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Abstract 

^ ' We study the theoretical advantages of active learning over passive learning. Specifically, we 

, prove that, in noise-free classifier learning for VC classes, any passive learning algorithm can be 

transformed into an active learning algorithm with asymptotically strictly superior label complexity 
for all nontrivial target functions and distributions. We further provide a general characterization 
of the magnitudes of these improvements in terms of a novel generalization of the disagreement 
coefficient. We also extend these results to active learning in the presence of label noise, and find 
that even under broad classes of noise distributions, we can typically guarantee strict improvements 
over the known results for passive learning. 

Keywords: Active Learning, Selective Sampling, Sequential Design, Statistical Learning Theory, 
c/2 i PAC Learning, Sample Complexity 

^ . 1. Introduction and Background 

, The recent rapid growth in data sources has spawned an equally rapid expansion in the number of 

potential applications of machine learning methodologies to extract useful concepts from this data. 
However, in many cases, the bottleneck in the application process is the need to obtain accurate 
annotation of the raw data according to the target concept to be learned. For instance, in webpage 
classification, it is straightforward to rapidly collect a large number of webpages, but training an 
accurate classifier typically requires a human expert to examine and label a number of these web- 
pages, which may require significant time and effort. For this reason, it is natural to look for ways 
^ , to reduce the total number of labeled examples required to train an accurate classifier. In the tradi- 

^ ■ tional machine learning protocol, here refeiTcd to as passive learning, the examples labeled by the 
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expert are sampled independently at random, and the emphasis is on designing learning algorithms 
that make the most effective use of the number of these labeled examples available. However, it 
is possible to go beyond such methods by altering the protocol itself, allowing the learning algo- 
rithm to sequentially select the examples to be labeled, based on its observations of the labels of 
previously-selected examples; this interactive protocol is referred to as active learning. The objec- 
tive in designing this selection mechanism is to focus the expert's efforts toward labeling only the 
most informative data for the learning process, thus eliminating some degree of redundancy in the 
information content of the labeled examples. 

It is now well-established that active learning can sometimes provide significant practical and 
theoretical advantages over passive learning, in terms of the number of labels required to obtain a 
given accuracy. However, our current understanding of active learning in general is still quite limited 
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in several respects. First, since we are lacking a complete understanding of the potential capabil- 
ities of active learning, we are not yet sure to what standards we should aspire for active learning 
algorithms to meet, and in particular this challenges our ability to characterize how a "good" active 
leai^ning algorithm should behave. Second, since we have yet to identify a complete set of general 
principles for the design of effective active learning algorithms, in many cases the most effective 
known active learning algorithms have problem-specific designs (e.g., designed specifically for lin- 
eai- separators, or decision trees, etc., under specific assumptions on the data distribution), and it 
is not clear what components of their design can be abstracted and transfened to the design of 
active learning algorithms for different learning problems (e.g., with different types of classifiers, 
or different data distributions). Finally, we have yet to fully understand the scope of the relative 
benefits of active learning over passive learning, and in particular the conditions under which such 
improvements are achievable, as well as a general characterization of the potential magnitudes of 
these improvements. In the present work, we take steps toward closing this gap in our understanding 
of the capabilities, general principles, and advantages of active learning. 

Additionally, this work has a second theme, motivated by practical concerns. To date, the ma- 
chine learning community has invested decades of research into constructing solid, reliable, and 
well-behaved passive learning algorithms, and into understanding their theoretical properties. We 
might hope that an equivalent amount of effort is not required in order to discover and understand 
effective active learning algorithms. In particular, rather than starting from scratch in the design 
and analysis of active learning algorithms, it seems desirable to leverage this vast knowledge of 
passive learning, to whatever extent possible. For instance, it may be possible to design active 
learning algorithms that inherit certain desirable behaviors or properties of a given passive learning 
algorithm. In this way, we can use a given passive learning algorithm as a reference point, and 
the objective is to design an active learning algorithm with performance guarantees strictly superior 
to those of the passive algorithm. Thus, if the passive learning algorithm has proven effective in 
a variety of common learning problems, then the active learning algorithm should be even better 
for those same learning problems. This approach also has the advantage of immediately supplying 
us with a collection of theoretical guarantees on the performance of the active learning algorithm: 
namely, improved forms of all known guarantees on the performance of the given passive learning 
algorithm. 

Due to its obvious practical advantages, this general line of informal thinking dominates the 
existing literature on empirically-tested heuristic approaches to active learning, as most of the pub- 
lished heuristic active learning algorithms make use of a passive learning algorithm as a subroutine 
(e.g., SVM, logistic regression, k-NN, etc.), constructing sets of labeled examples and feeding them 
into the passive learning algorithm at various times during the execution of the active learning algo- 
rithm (see the references in Section|71). Below, we take a more rigorous look at this general strategy. 
We develop a reduction-style framework for studying this approach to the design of active learning 
algorithms relative to a given passive leai^ning algorithm. We then proceed to develop and analyze a 
variety of such methods, to realize this approach in a very general sense. 

Specifically, we explore the following fundamental questions. 

• Is there a general procedure that, given any passive learning algorithm, transforms it into an 
active learning algorithm requiring significantly fewer labels to achieve a given accuracy? 

• If so, how large is the reduction in the number of labels required by the resulting active learn- 
ing algorithm, compared to the number of labels required by the original passive algorithm? 
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• What are sufficient conditions for an exponential reduction in the number of labels required? 

• To what extent can these methods be made robust to imperfect or noisy labels? 

In the process of exploring these questions, we find that for many interesting learning problems, the 
techniques in the existing literature are not capable of realizing the full potential of active learn- 
ing. Thus, exploring this topic in generality requires us to develop novel insights and entirely new 
techniques for the design of active learning algorithms. We also develop corresponding natural 
complexity quantities to characterize the performance of such algorithms. Several of the results we 
establish here are more general than any related results in the existing literature, and in many cases 
the algorithms we develop use significantly fewer labels than any previously published methods. 



1.1 Background 



The term active learning refers to a family of supervised learning protocols, characterized by the 
ability of the learning algorithm to pose queries to a teacher, who has access to the target concept 
to be learned. In practice, the teacher and queries may take a variety of forms: a human expert, 
in which case the queries may be questions or annotation tasks; nature, in which case the queries 
may be scientific experiments; a computer simulation, in which case the queries may be particu- 
lai- parameter values or initial conditions for the simulator; or a host of other possibilities. In our 
present context, we will specifically discuss a protocol known as pool-based active learning, a type 
of sequential design based on a collection of unlabeled examp l es; th i s seems to be the most com- 
mon form of active learning in practical use today (e.g.. Settles , 2010 ; Baldridge and Palmer , 20091; 
Gangadharaiah. Brown, and Carbonell ^ 2009 ^ Hoi. Jin. Zhu. and LvuL 20061: Luo, Kramer, Gol dgof, 



Hall Samson. Re msen. and Hopkins. boOSi; IRov and McCa llumU2 00ll;lTong and KoUelboOll; Mc- 
Callum and Nigam.ll998h. We wiU no t discuss alternative mode ls of active learning, such as online 
( Dekel. Gentile, and Sridharan , 2010l) or exact ( Hegedus , 1995 ). In the pool-based active learning 
setting, the leai^ning algorithm is supplied with a large collection of unlabeled examples (the pool), 
and is allowed to select any example from the pool to request that it be labeled. After observing 
the label of this example, the algorithm can then select another unlabeled example from the pool 
to request that it be labeled. This continues sequentially for a number of rounds until some halt- 
ing condition is satisfied, at which time the algorithm returns a function intended to approximately 
mimic and generalize the observed labeling behavior. This setting contrasts with passive learning, 
in which the learning algorithm is supplied with a collection of labeled examples. 

Supposing the labels received agree with some true target concept, the objective is to use this 
returned function to approximate the true target concept on future (previously unobserved) data 
points. The hope is that, by carefully selecting which examples should be labeled, the algorithm can 
achieve improved accuracy while using fewer labels compared to passive learning. The motivation 
for this setting is simple. For many modern machine learning problems, unlabeled examples are 
inexpensive and available in abundance, while annotation is time-consuming or expensive. For in- 
stance, this is the case in the aforementioned webpage classification problem, where the pool would 
be the s et of all webpa ges, and labeling a webpage requires a human expert to examine the website 



content. ISettlesI (|2010|) surveys a variety of other applications for which active learning is presently 
being used. To simplify the discussion, in this work we focus specifically on binary classification, in 
which there are only two possible labels. The results generalize naturally to multiclass classification 
as well. 
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As the above description indicates, when studying the advantages of active learning, we are 
primarily interested in the number of label requests sufficient to achieve a given accuracy, a quantity 
refeiTcd to as the label complexity (Definition[T]below). Although active leai^ning has been an active 
topic in the machine leai^ning literature for many years now, our theoretical understanding of this 
topic was largely lacking until very recently. However, within the past few years, there has been an 
explosion of progress. These advances can be grouped into two categories: namely, the realizable 
case and the agnostic case. 



1.1.1 The Realizable Case 



In the realizable case, we ai^e interested in a particularly strict scenario, where the true label of 
any example is determined by a function of the features (covariates), and where that function has 
a specific known form (e.g., linear separator, decision tree, union of intervals, etc.); the set of 
classifiers having this known form is referred to as the concept space. The natural formalization 
of the realizab le case is very much analogous to the well-known PAC model for passive learning 
(|Valianli Il984h . In the realizable case, there are obvious examples of learning problems where 
active learning can provide a significant advantage compared to passive learning; for instance, in 
the problem of learning threshold classifiers on the real line (Example [T] below), a kind of binary 
search strategy for selecting which examples to request labels for naturally leads to exponential 
improvements in label complexity compared to learning from random labeled examples (passive 
learning). As such, there is a natural attraction to determine how general this phenomenon is. 
This leads us to think about general-purpose learning strategies (i.e., which can be instantiated for 
more than merely threshold classifiers on the real line), which exhibit this binary search behavior in 
various special cases. 

The first such ge neral-purpose strategy to e merge in the literature was a particularly elegant 
strategy proposed by ICohn. Atlas, and Ladnen (| 19941) . typically referred to as CAL after its dis- 
coverers (Meta-Algorithm 2 below). The strategy behind CAL is the following. The algorithm 
examines each example in the unlabeled pool in sequence, and if there are two classifiers in the 
concept space consistent with all previously-observed labels, but which disagree on the label of this 
next example, then the algorithm requests that label, and otherwise it does not. For this reason, be- 
low we refer to the general family of algorithms inspired by CAL as disagreement-based methods. 
Disagreement-based methods aie sometimes refeiTcd to as "mellow" active learning, since in some 
sense this is the least we can expect from a reasonable active learning algorithm; it never requests 
the label of an example whose label it can infer from information already available, but otherwise 
makes no attempt to seek out particularly informative examples to request the labels of. That is, the 
notion of informativeness implicit in disagreement-based methods is a binary one, so that an exam- 
ple is either informative or not informative, but there is no further ranking of the informativeness 
of examples. The disagreemen t-based strategy is quite general , and obviously leads to algorithms 
that are at least reasonable, but ICohn. Atlas, and Ladnerl (|l994|) did not study the label complexity 
achieved by their strategy in any generality. 



In a Bayesian variant of the realizable setting. iFreund. Seung. Shamir, and Tishbyl(|l997|) studied 
an algorithm known as Query by Committee (QBC), which in some sense represents a Bayesian 
variant of CAL. However, QBC does distinguish between different levels of informativeness beyond 
simple disagreement, based on the amount of disagreement on a random unlabeled example. They 
were able to analyze the label complexity achieved by QBC in terms of a type of information gain. 
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and found that when the information gain is lower bounded by a positive constant, the algorithm 
achieves a label complexity exponentially smaller than the known results for passive learning. In 
particular, this is the case for the threshold learning problem, and also for the problem of learning 
higher-dimensional (nearly balanced) linear sepai^ators when the data satisfy a certain (uniform) 
distribution. Below, we will not discuss this analysis further, since it is for a slightly different 
(Bayesian) setting. However, the results below in our present setting do have interesting implications 
for the Bayesian setting as well, as discussed in the recent work of lYang. Hanneke. and Carbonell 

fcoiib . 

The first general analysis of the label cor nplexity of active learning in the (non-Bayesian) real- 
izable case came in the breakthrough work of Dasguptal ( 2005 ). In that work, Dasgupta proposed a 
quantity, called the splitting index, to characterize the label complexities achievable by active learn- 
ing. The splitting index analysis is noteworthy for several reasons. First, one can show it provides 
nearly tight bounds on the minimax label complexity for a given concept space and data distribution. 
In particular, the analysis matches the exponential improvements known to be possible for threshold 
classifiers, as well as generalizations to higher-di mensional homogeneous linear separato rs under 
near-uniform distributions (as first established by lPasgupta. Kalai. and Monteleonil (|2005|, |2009|)). 
Second, it provides a novel notion of informativeness of an example, beyond the simple binary 
notion of informativeness employed in disagreement-based methods. Specifically, it describes the 
informativeness of an example in terms of the number of pairs of well-separated classifiers for 
which at least one out of each pair will definitely be contradicted, regardless of the example's label. 
Finally, unlike any other existing work on active learning (present work included), it provides an el- 
egant description of the trade-off between the number of label requests and the number of unlabeled 
examples needed by the learning algorithm. Another interesting byproduct of Dasgupta's work is a 
better understanding of the nature of the improvements achievable by active learning in the general 
case. In particular, his work clearly illustrates the need to study the label complexity as a quantity 
that varies depending on the particular target concept and data distribution. We will see this issue 
arise in many of the examples below. 

Coming from a slightly different perspective, iHanneke (l2007ah later an alyzed the label com- 
plexit y of active learning in terms of an extension o f the teaching d imension (iGoldman and Keamsl. 
1995 ). Related quantities w ere oreviouslv used bv HegedusI ( 1995) and Hellerstein. Pillaipakkam- 



natt, Raghavan, and WiMns ( 19961) to tight ly characterize the number of membership queries suf- 
ficient for Exact learning; iHannekd (|2007ah provided a natural generalization to the PAC learning 
setting. At this time, it is not clear how this quantity relates to the splitting in dex. Fro r n a pr actical 
perspective, in some instances it may be easier to calculate (see the work of iNowakI (|2008h for a 
discussion related to this), though in other cases the opposite seems true. 

The nex t progre ss toward understanding the label complexity of active learning came in the work 
of Hanneke (l2007bl) . who introduced a quantity called the disagreement coefficient (Definition|9]be- 
low), accompanied by a technique for analyzing disagreement-based acti ve learning algo rithms. In 
particular, implicit in that work, and made explicit in the later work of iHannekd (|2011|) . was the 
first general characterization of the label complexities achieved by the original CAL strategy for 
active learning in the realizable case, stated in terms of the disagreement coefficient. The results of 
the present work are direct descendents of that 2007 paper, and we will discuss the disagreement 
coefficient, and results based on it, in substantial detail below. Disagreement-based active learners 
such as CAL are known to be sometimes suboptimal relative to the splitting index analysis, and 
therefore the disagreement coefficient analysis sometimes results in larger label complexity bounds 
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than the sphtting index analysis. However, in many cases the label complexity bounds based on 
the disagreement coefficient are surprisingly good considering the simplicity of the methods. Fur- 
thermore, as we will see below, the disagreement coefficient has the practical benefit of often being 
fairly straightforward to calculate for a variety of learning problems, particularly when there is a 
natural geometric interpretation of the classifiers and the data distribution is relatively smooth. As 
we discuss below, it can also be used to bound the label complexity of active learning in noisy 
settings. For these reasons (simplicity of algorithms, ease of calculation, and applicability beyond 
the realizable case), subsequent work on the label complexity of active learning has tended to favor 
the disagree ment-based approach, making use of the disagreement c oefficient to bound the label 
complexitv ('Pasgupta. Hsu, an d Monteleoni , 200?!: FriedmanL 2009 • Beygelz i mer, Dasgupta, and 
Lang f ord, 1 2009; W angl. 120091: IB alcan. Hanneke. and Vaughanl. l2010l.lHannekell201ll:lKoltchinskii . 
2010l : iBevg elzim er. Hsu. Langford. and Zhang. 2010; Mahalanabisl.l201ll ; IWangl . l201lb . A signif- 



icant part of the present paper focuses on extending and generalizing the disagreement coefficient 
analysis, while still maintaining the relative ease of calculation that makes the disagreement coeffi- 
cient so useful. 

In addition to many positive results, Dasguptal (2005) also pointed out several negative results, 
even for very simple and natural learning problems. In particular, for many problems, the minimax 
label complexity of act ive learning will be no better than that of passive learning. In fact. B alcan. 
Hanneke, and Vaughan later showed that, for a certain type of active learning algorithm - 

namely, self-verifying algorithms, which themselves adaptively determine how many label requests 
they need to achieve a given accuracy - there ai^e even particular target concepts and data distribu- 
tions for which no active learning algorithm of that type can outperform passive learning. Since all 
of the above label complexity analyses (splitting index, teaching dimension, disagreement coeffi- 
cient) apply to certain respective self-verifying learning algorithms, these negative results are also 
reflected in all of the existing general label complexity analyses as well. 

\y hile at first these negative results may seem discouraging, Ib alcan. Hanneke. and Vaughan 



(120101) noted that if we do not require the algorithm to be self-verifying, instead simply measuring 
the number of label requests the algorithm needs to find a good classifier, rather than the number 
needed to both find a good classifier and verify that it is indeed good, then these negative results 
vanish. In fact, (shockingly) they were able to show that for any concept space with finite VC 
dimension, and any fixed data distribution, for any given passive learning algorithm there is an 
active learning algorithm with asymptotically superior label complexity for every nontrivial target 
concept! A positive result of this generality and strength is certainly an exciting advance in our 
understanding of the advantages of active learning. But perhaps equally exciting are the unresolved 
questions raised by that work, as there are potential opportunities to strengthen, generalize, simplify, 
and elaborate on this result. First, note that the above statement allows the active learning algorithm 
to be specialized to the particular- distribution according to which the (unlabeled) dat a are sampled, 
and indeed the active learning method used by lBalcan. Hanneke. and VaughanI (120101) in their proof 
has a rather strong direct dependence on the data distribution (which cannot be removed by simply 
replacing some calculations with data-dependent estimators). One interesting question is whether 
an alternative approach might avoid this direct distribution-dependence in the algorithm, so that 
the claim can be strengthened to say that the active algorithm is superior to the passive algorithm 
for all nontrivial target concepts and data distributions. This question is interesting both theoreti- 
cally, in order to obtain the strongest possible theorem on the advantages of active learning, as well 
as practically, since direct access to the distribution from which the data are sampled is typically 
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not avail able in practical learning scenarios. A second question left open by Balcan. Hanneke, and 
Vaughan (120101) regards the magnitude of the gap between the active and passive label complexities. 
Specifically, although they did find particularly nasty learning problems where the label complexity 
of active learning will be close to that of passive learning (though always better), they hypothesized 
that for most natural learning problems, the improvements over passive learning should typically 
be exponentially large (as is the case for threshold classifiers); they gave many examples to illus- 
trate this point, but left open the problem of characterizing general sufficient conditions for these 
exponential improvements to be achievable, ev en when they are not achievable by se lf-verifying 
algorithms. Another question left unresolved by lBalcan. Hanneke. and Vaughan (l20ld) is whether 
this type of general improvement guarantee might be realized by a computationally efficient active 
learning algorithm. Finally, they left open the question of whether such general results might be 
further generalized to s ettings that involve noisv labels. The present work picks up where Balcan. 
Hanneke, and Vaughan (I2OIOI) left off in several respects, making progress on each of the above 
questions, in some cases completely resolving the question. 



1.1.2 The Agnostic Case 



In addition to the above advances in our understanding of active learning in the realizable case, there 
has also been wonderful progress in making these methods robust to imperfect teachers, feature 
space underspecification, and model misspecification. This general topic goes by the narne agno stic 
active learning, from its roots in the agnostic PAC model (IKeams. Schapire. and Selliel 11994) . In 
contrast to the realizable case, in the agnostic case, there is not necessarily a perfect classifier of a 
known form, and indeed there may even be label noise so that there is no perfect classifier of any 
form. Rather, we have a given set of classifiers (e.g., linear separators, or depth-limited decision 
trees, etc.), and the objective is to identify a classifier whose accuracy is not much worse than the 
best classifier of that type. Agnostic learning is strictly more general, and often more difficult, than 
realizable learning; this is true for both passive leai^ning and active learning. However, for a given 
agnostic learning problem, we might still hope that active learning can achieve a given accuracy 
using fewer labels than required for passive learning. 

The general topic of a gnostic active learning got its first taste of real progress from Balcan. 
Beygelzimer, and Langford (|2006aL 120091) with the publication of the (agnostic active) algo- 
rithm. This method is a noise-robust disagreement-based algorithm, which can be applied with 
essentially arbitrary types of classifiers under arbitrary noise distributions. It is interesting both for 
its effectivene ss and (as with CAP its elegance. The original work of Balcan. Beygelzimer, and 
Langford'^06a|, |20oJ showed that, in some special cases (thresholds, and homogeneous linear 
separators under a uniform distribution), the algorithm does achieve improved label complexi- 
ties compared to the known results for passive learning. 



Using a different type of general active learning strategy, iHannekel (|2007a|) found that the teach- 
ing dimension analysis (discussed above for the realizable case) can be extended beyond the real- 
izable case, aniving at general bounds on the label complexity under arbitrary noise distributions. 
These bounds improve over the known results for passive leaimng in many cases. However, the 
algorithm requires direct access to a certain quantity that depends on the noise distribution (namely, 
the noise rate, defined in Section[6]below), which would not be available in many real-world learning 
problems. 
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Later. iHannekg (l2007bh established a general characterization of the label complexities achieved 
by A^, expressed in terms of the disagreement coefficient. The result holds for arbitrary types of 
classifiers (of finite VC dimension) and arbitrary noise distributions, and represents the natural gen- 
eralization of the aforementioned realizable-case analysis of CAL. In many cases, this result shows 
improvements over the known results for passive learning. Furthermore, because of the simplicity of 
the disagreement coefficient, the bound can be calculated for a variety of natural learning problems. 

Soon after this, IPasgupta. Hsu, and Monteleonil (|2007h proposed a new active learning strat- 
egy, which is also effective in the agnosti c setting. Like A"^, the new alggrithrn is a noise-robust 
disagreement-based method. The work of IPasgupta. Hsu, and Monteleonil (|2007h is significant for 
at least two reasons. First, they were able to establish a general label complexity bound for this 
method based on the disagre ement coefficient . The bound is similar in form to the previous label 
complexity bound for by Hanneke ( 2007bl) . but impro ves the dependence of the bound on the 
disagreement coefficient. Second, the proposed method of IPasgupta. Hsu, and Monteleonil (|2007h 
set a new standard for computational and aesthetic simplicit y in agnostic active learning algorithms. 
This w ork has since been followed by related methods of iBeygelzimer. Pasgupta. and Langford 
( 2009 ) and Beygelzimer. Hsu. Langford. and Zhand (201oh. In particular. Bevgelzimer. Pasgupta, 
and Langford (I2OO9I) develop a method capable of learning under an essentially arbitrary loss func- 
tion; t hey also show label complexity bounds similar to those of IPasgupta. Hsu, and Monteleoni 
(I2OO7I) . but applicable to a larger class of loss functions, and stated in terms of a generalization of 
the disagreement coefficient for arbitrary loss functions. 

While the above results are encouraging, the guarantees reflected in these label complexity 
bounds essentially take the form of (at best) constant factor improvements; specifically, in some 
cases the bounds improve the dependence on the noise rate factor (defined in Section |6] below), 
compared to the known results for passive learning. In fact, iKaariainen showed that any 

label complexity bound depending on the noise distribution only via the noise rate cannot do better 
than this type of constant-factor improvement. This raised the question of whether, with a more de- 
tailed description of the noise distribution, one can show improve ments in the asymptotic for m of the 
label complexity compared to passive learning. Toward this end. lCastro and NowakI (|2008|) studied 
a certain refin ed description of the noise conditions, related to the margin conditions of Mammen 
and Tsybakov (|l999h . which are well-studied in the passive learning literature. Specifically, they 
found that in some special cases, under certain restrictions on the noise distribution, the asymptotic 
form of the label complexity can be improved compared to passive learning, and in some cases the 
improvements can even be exponential in magnitude; to achieve this, they developed algorithms 
specifically tai lored to the types of classifiers the y studied (threshold classifiers and boundary frag- 
ment classes). iBalcan. Broder. and ZhangI (|2007|) later extende d this result to general h omogeneous 
linear separators under a uniform distribution. Following this, iHannekel (l2009al |201 ll) generalized 
these results, showing that both of the published general agnostic active earning algorithms ( Bal- 
can, Beygelzimer, and Langford. l2009l ; Pasgupta. Hsu, and Monteleoni , 200?!) can also achieve 
these types of improvements in the asymptotic form of the label complexity; he further proved gen- 
eral bounds on the label complexities of these methods, again based on the disagreement coefficient, 
which apply to arbitrary types of classifiers, and wh i ch ref lect these types of improvements (under 
conditions on the disagreement coefficient). IWangI (|2009l) later bounded the label complexity of 
A'^ under somewhat different noise conditions, in particular identifying weaker noise conditions 
sufficient for these impro vements to be exponential in magnitude (again, under conditions on the 
disagreement coefficient). iKoltchinskiil (I201Q) has recently improved on some of Hanneke's results. 
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refining certain logarithmic factors and simplifying the proofs, using a slightly different algorithm 
based on similar principles. Though the present work discusses only classes of finite VC dimen- 
sion, most of the above references also contain results for various types of nonparametric classes 
with infinite VC dimension. 

At present, all of the published bounds on the label complexity of agnostic active learning also 
apply to self-verifying algorithms. As mentioned, in the realizable case, it is typically possible to 
achieve significantly better label complexities if we do not require the active learning algorithm to 
be self-verifying, since the verification of learning may be more difficult than the learning itself 
( Balcan. Hanneke. and Vaughan . 2O10l) . We might wonder whether this is also true in the agnostic 
case, and whether agnostic active learning algorithms that ai^e not self-verifying might possibly 
achieve significantly better label complexities than the existing label complexity bounds described 
above. We investigate this in depth below. 



1.2 Summary of Contributions 

In the present work, we build on and extend the above results in a variety of ways, resolving a 
number of open problems. The main contributions of this work can be summarized as follows. 



• We formally define a notion of a universal activizer, a meta-algorithm that transforms any pas- 
sive learning algorithm into an active learning algorithm with asymptotically strictly superior 
label complexities for all nontrivial target concepts and distributions. 

• We analyze the existing strategy of disagreement-based active learning from this perspec- 
tive, precisely characterizing the conditions under which this strategy can lead to a universal 
activizer in the realizable case. 



• We propose a new type of active learning algorithm, based on shatterable sets, and prove that 
we can construct universal activizers for the realizable case based on this idea; in particular, 
this overcomes the issue of distribution-dependence in the existing results mentioned above. 

• We present a novel generalization of the disagreement coefficient, along with a new asymp- 
totic bound on the label complexities achievable by active learning in the realizable case; this 
new bound is often significantly smaller than the existing results in the published literature. 

• We state new concise sufficient conditions for exponential improvements over passive learn- 
ing to be achievable in the realizable case, including a significant weakening of known con- 
ditions in the published literature. 

• We present a new general-purpose active learning algorithm for the agnostic case, based on 
the aforementioned idea involving shatterable sets. 

• We prove a new asymptotic bound on the label complexities achievable by active learning in 
the presence of label noise (the agnostic case), often significantly smaller than any previously 
published results. 

• We formulate a general conjecture on the theoretical advantages of active learning over pas- 
sive learning in the presence of arbitrary types of label noise. 
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1.3 Outline of the Paper 

The paper is organized as follows. In Section |2l we introduce the basic notation used throughout, 
formally define the learning protocol, and formally define the label complexity. We also define the 
notion of an activizer, which is a procedure that transforms a passive learning algorithm into an 
active learning algorithm with asymptotically superior label complexity. In Section [3j we review 
the established technique of disagreement-based active learning, and prove a new result precisely 
characterizing the scenarios in which disagreement-based active learning can be used to construct 
an activizer. In particular, we find that in many scenarios, disagreement-based active learning is not 
powerful enough to provide the desired improvements. In SectionlH we move beyond disagreement- 
based active learning, developing a new type of active learning algorithm based on shatterable sets 
of points. We apply this technique to construct a simple 3-stage procedure, which we then prove is a 
universal activizer for any concept space of finite VC dimension. In Section |5l we begin by review- 
ing the known results for bounding the label complexity of disagreement-based active learning in 
terms of the disagreement coefficient; we then develop a somewhat more involved procedure, again 
based on shatterable sets, which takes full advantage of the sequential nature of active leanring. In 
addition to being an activizer, we show that this procedure often achieves dramatically superior la- 
bel complexities than achievable by passive learning. In particular, we define a novel generalization 
of the disagreement coefficient, and use it to bound the label complexity of this procedure. This 
also provides us with concise sufficient conditions for obtaining exponential improvements over 
passive learning. Continuing in Section [6l we extend our framework to allow for label noise (the 
agnostic case), and discuss the possibility of extending the results from previous sections to these 
noisy learning problems. We first review the known results for noise-robust disagreement-based ac- 
tive learning, and characterizations of its label complexity in terms of the disagreement coefficient 
and Mammen-Tsybakov noise parameters. We then proceed to develop a new type of noise-robust 
active leai^ning algorithm, again based on shatterable sets, and prove bounds on its label complexity 
in terms of our aforementioned generalization of the disagreement coefficient. Additionally, we 
present a general conjecture concerning the existence of activizers for certain passive learning al- 
gorithms in the agnostic case. We conclude in Section |7] with a host of enticing open problems for 
future investigation. 



2. Definitions and Notation 

For most of the paper, we consider the following formal setting. There is a measurable space 
(X,F y), where X is ca lled the instance space; for simplicity, we suppose this is a standard Borel 
space ( Srivastaval 1998 ) (e.g., under the usual Borel cr-algebra), though most of the results 



generalize. A classifier is any measurable function h : X — { — 1,+!}. There is a set C of clas- 
sifiers called the concept space. In the realizable case, the learning problem is characterized as 
follows. There is a probability measure V on X, and a sequence Zx = {Xi,X2, . . .} of indepen- 
dent A:'-valued random variables, each with distribution V. We refer to these random variables as 
the sequence of unlabeled examples; although in practice, this sequence would typically be large 
but finite, to simplify the discussion and focus strictly on counting labels, we will suppose this se- 
quence is inexhaustible. There is additionally a special element / G C, called the target function, 
and we denote by Yi = f{Xi); we further denote by Z = {{Xi,Yi), {X2,Y2), . . .} the sequence 
of labeled examples, and for m G N we denote by Zm. = {{Xi, Yi), (X2, 12), • • • , {Xm, Ym)} the 
finite subsequence consisting of the first m elements of Z. For any classifier h, we define the error 
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rate er(/i) = V{x : h{x) 7^ f{x))- Informally, the learning objective in the realizable case is to 
identify some h with small er(/i) using elements from Z, without direct access to /. 

An active learning algorithm A is permitted direct access to the Zx sequence (the unlabeled 
examples), but to gain access to the Yi values it must request them one at a time, in a sequential 
manner. Specifically, given access to the Zx values, the algorithm selects any index i G N, requests 
to observe the Yi value, then having observed the value of Yi, selects another index i' , observes 
the value of Yii, etc. The algorithm is given as input an integer n, called the label budget, and 
is permitted to observe at most n labels total before eventually halting and returning a classifier 
hn = A{n); that is, by definition, an active learning algorithm never attempts to access more than 
the given budget n number of labels. We will then study the values of n sufficient to guarantee 
E[er(/i„)] < e, for any given value e G (0, 1). We refer to this as the label complexity. We will 
be particularly interested in the asymptotic dependence on e in the label complexity, as e — )• 0. 
Formally, we have the following definition. 

Definition 1 An active learning algorithm A achieves label complexity A(-, •, •) if, for every target 
function f, distribution V, e & (0, 1), and integer n > A(e, /, V), we have E [er (^(n))] < e. o 

This definition of label complexity is similar to one originally studied bv Balcan. Hanneke, and 
Vaughan (I2OIOI) . It has a few features worth noting. First, the label complexity has an explicit 



dependence on the target function / and distribution V. As noted by Dasguptal ( 2005 ). we need 



this dependence if we are to fully understand the range of label complexities achievable by active 
learning; we further illustrate this issue in the examples below. The second feature to note is that 
the label complexity, as defined here, is simply a sufficient budget size to achieve the specified 
accuracy. That is, here we are asking only how many label reques ts are required for the algorithm 



to ach ieve a given accuracy (in expectation). However, as noted by lBalcan. Hanneke. and Vaughan 



(I2OIOI) . this number might not be sufficiently large to detect that the algorithm has indeed achieved 
the required accuracy based only on the observed data. That is, because the number of labeled 
examples used in active learning can be quite small, we come across the problem that the number 
of labels needed to learn a concept might be significantly smaller than the number of labels needed 
to verify that we have successfully leai^ned the concept. As such, this notion of label complexity 
is most useful in the design of effective learning algorithms, rather than for predicting the number 
of labels an algorithm should request in any particular application. Specifically, to design effective 
active learning algorithms, we should generally desire small label complexity values, so that (in the 
extreme case) if some algorithm A has smaller label complexity values than some other algorithm 
A! for all target functions and distributions, then (all other factors being equal) we should clearly 
prefer algorithm A over algorithm A!; this is true regardless of whether we have a means to detect 
(verify) how large the improvements offered by algorithm A over algorithm A! are for any particular 
application. Thus, in our present context, this notion of label complexity plays a role analogous to 
concepts such as universal consistency or admissibility, which are also generally useful in guiding 
the design of effective algorithms, but are not intended to be informa t ive in the context of any 



particular application. See the work of iBalcan. Hanneke. and Vaughan (I2OIOI) for a discussion of 



this issue, as it relates to a definition of label complexity similar to that above, as well as other 
notions of label complexity from the active learning literature (some of which include a verification 
requirement). 

We will be interested in the performance of active learning algorithms, relative to the perfor- 
mance of a given passive learning algorithm. In this context, a passive learning algorithm A takes 
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as input a finite sequence of labeled examples £ G Unl'^ ^ {"^i +1})"' returns a classifier 
h = A{C). We allow both active and passive learning algorithms to be randomized: that is, to have 
internal randomness, in addition to the given random data. We define the label complexity for a 
passive learning algorithm as follows. 

Definition 2 A passive learning algorithm A achieves label complexity A(-, •, •) if, for every target 
function f, distribution V, £ £ (0, 1), and integer n > A(e, /, V), we have E [er {A {Zn))] < e. o 

Although technically some algorithms may be able to achieve a desired accuracy without any 
observations, to make the general results easier to state (namely, those in Section |5]), unless oth- 
erwise stated we suppose label complexities (both passive and active) take strictly positive values, 
among N U {oo}; note that label complexities (both passive and active) can be infinite, indicating 
that the coiTcsponding algorithm might not achieve expected error rate e for any n G N. Both the 
passive and active label complexities are defined as a number of labels sufficient to guarantee the 
expected error rate is at most e. It is also common in the literature to discuss the number of label 
requests sufficient to guara ntee the error rate is at most g with hish probability 1 — J (e.g.. Bal- 
can, Hanneke, and Vaughan, 2010|) . In the present work, we formulate our results in terms of the 



expected eiTor rate because it simplifies the discussion of asymptotics, in that we need only study 
the behavior of the label complexity as the single argument e approaches 0, rather than the more 
complicated behavior of a function of e and 5 as both e and 5 approach at various relative rates. 
However, we note that analogous results for these high-probability guarantees on the eiTor rate can 
be extracted from the proofs below without much difficulty, and in several places we explicitly state 
results of this form. 

Below we employ the standard notation from asymptotic analysis, including O(-), o(-), il(-), 
ll'(-), ©(•), and In all contexts below not otherwise specified, the asymptotics are always 
considered as g — )• when considering a function of e, and as n — )• oo when considering a function 
of n; also, in any expression of the form "x — )• 0," we always mean the limit/rom above (i.e., x | 0). 
For instance, when considering nonnegative functions of e, Aa(e) and Ap(e), the above notations 
are defined as follows. We say Aa(e) = o(Ap(e)) when lirn ^"j^j = 0, and this is equivalent to 
writing Ap(e) = uj{\a{e)), Xa{e) < \p{£), or Ap(e) > Aa(e). We say \a{e) = 0(Ap(e)) when 
limsup ^44 < oo, which can be equivalently expressed as Ap(e) = Q.{\a{e))- Finally, we write 

Aa(e) = 0(Ap(e)) to mean that both Aa(e) = 0{Xp{e)) and Aa(e) = Q.{\p{e)) are satisfied. 

Define the class of functions Polylog(l/e) as those g : (0, 1) — )• [0,oo) such that, for some 
k G [0, oo), g{e) = 0(log'^(l/e)). For a label complexity A, also define the set Nontrivial(A) as 
the collection of all pairs (/, V) of a classifier and a distribution such that, Ve > 0, K{e, f, V) < oo, 
and yg G Polylog(l/g), A{e, /, V) = co{g{e)). 

In this context, an active meta-algorithm is a procedure Aa taking as input a passive algorithm 
Ap and a label budget n, such that for any passive algorithm Ap, Aa{Ap, •) is an active learning 
algorithm. We define an activizer for a given passive algorithm as follows. 

Definition 3 We say an active meta-algorithm Aa activizes a passive algorithm Ap for a concept 
space C if the following holds. For any label complexity Ap achieved by Ap, the active learning al- 
gorithm Aa{Ap, •) achieves a label complexity A^ such that, for every / G C and every distribution 
V on X with (/, V) G Nontrivial(Ap), there exists a constant c G [1, oo) such that 

Aa{ce,f,V) = o{Ap{e,f,V)). 
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In this case, Aa is called an activizer /or Ap with respect to C, and the active learning algorithm 
Aa{Ap, •) is called the ^a-activized Ap. o 

We also refer to any active meta-algorithm Aa that activizes every passive algorithm Ap for C 
as a universal activizer for C. One of the main contributions of this work is establishing that such 
universal activizers do exist for any VC class C. 

A bit of explanation is in order regarding Definition [3l We might interpret it as follows: an 
activizer for Ap strongly improves (in a little-o sense) the label complexity for all nontrivial target 
functions and distributions. Here, we seek a meta-algorithm that, when given Ap as input, results in 
an active learning algorithm with strictly superior label complexities. However, there is a sense in 
which some distributions V or target functions / are trivial relative to Ap. For instance, perhaps Ap 
has a default classifier th at it is naturally b i ased t oward (e.g., with minimal V{x : h{x) = +1), as 
in the Closure algorithm (lAuer and Qrtnen. |2004) ). so that when this default classifier is the target 
function, Ap achieves a constant label complexity. In these trivial scenarios, we cannot hope to 
improve over the behavior of the passive algorithm, but instead can only hope to compete with it. 
The sense in which we wish to compete may be a subject of some controversy, but the implication 
of Definition [3] is that the label complexity of the activized algorithm should be strictly better than 
every nontrivial upper bound on the label complexity of the passive algorithm. For instance, if 
Ap{e,f,V) G Polylog(l/e), then we are guaranteed Aa{e,f,V) G Polylog(l/e) as well, but 
if Ap{e,f,V) = 0(1), we are still only guai^anteed Aa{e,f,V) S Polylog(l/e). This serves 
the purpose of defining a framework that can be studied without requiring too much obsession 
over small additive terms in trivial scenarios, thus focusing the analyst's efforts toward nontrivial 
scenarios where Ap has relatively large label complexity, which are precisely the scenarios for 
which active learning is truly needed. In our proofs, we find that in fact Polylog(l/e) can be 
replaced with log(l/e), giving a slightly broader definition of "nontrivial," for which all of the 
results below still hold. Section |7] discusses open problems regarding this issue of trivial problems. 

The definition of Nontrivial(-) also only requires the activized algorithm to be effective in sce- 
narios where the passive learning algorithm has reasonable behavior (i.e., finite label complexities); 
this is only intended to keep with the reducti on-based style of the framework , and in fact this re- 
striction can easily be lifted using a trick from Balcan. Hanneke. and Vaughan JlOlOl) (aggregating 
the activized algorithm with another algorithm that is always reasonable). 

Finally, we also allow a constant factor c loss in the e argument to A^. We allow this to be an 
arbitrary constant, again in the interest of allowing the analyst to focus only on the most signifi- 
cant aspects of the problem; for most reasonable passive learning algorithms, we typically expect 
Ap(e, /, V) = Poly(l/e), in which case c can be set to 1 by adjusting the leading constant factors of 
Aa. A careful inspection of our proofs reveals that c can always be set arbitrarily close to 1 without 
affecting the theorems below (and in fact, we can even get c = (1 + o(l)), a function of s). 

Throughout this work, we will adopt the usual notation for probabilities, such as P(er(/i) > e), 
and as usual we interpret this as measuring the corresponding event in the (implicit) underlying 
probability space. In particular, we make the usual implicit assumption that all sets involved in the 
analysis are measurable; where this assumption does not hold, we may turn to outer probabilities, 
though we will not make further mention of these technical details. We will also use the notation 
P^{-) to represent fc-dimensional product measures; for instance, for a measurable set A C X'', 
V'^Ia) = F{{X[, . . . G A), for independent ^-distributed random variables X[, ...,X'^. 

Additionally, to simplify notation, we will adopt the convention that = {0}, and V^{X^) = 1. 
Throughout, we will denote by 1a{z) the indicator function for a set A, which has the value 1 when 
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z G A and otherwise; additionally, at times it will be more convenient to use the bipolar indicator 
function, defined as l^(z) = 21a{z) — 1. 

We will require a few additional definitions for the discussion below. For any classifier h : X ^ 
{ — 1, +1} and finite sequence of labeled examples C G Um ('^ +!})'"> define the empirical 
error rate eTc{h) = \C\^^ J2{x y)ec ^{~y}i^i^))' completeness, define er0(/i) = 0. Also, for 
C = Zm, the first m labeled examples in the data sequence, abbreviate this as erm(/i) = evz^ih). 
For any distribution P on X, set of classifiers H, classifier h, and r > 0, define B-^ p(/i, r) = {g £ 
7i : P{x : h{x) ^ g{x)) < r}; when P = V, the distribution of the unlabeled examples, and V 
is clear- from the context, we abbreviate this as B-^(/i, r) = B-^ -p(/i, r); furthermore, when P = V 
and T-L = C, the concept space, and both V and C are clear from the context, we abbreviate this 
as B(/i, r) = Be -p(/i, r). Also, for any set of classifiers H, and any sequence of labeled examples 
^ e U™('^'x{-1,+1})™, define = {h € Ti : eicih) = 0}; for any (x, y) E ;f x{-l,+l}, 
abbreviate n[{x,y)] = n[{{x,y)}] ={h£n: h{x) = y}. 

We also adopt the usual definition of "shattering" used in learning theory (e.g., VapnikL 1998h . 
Specifically, for any set of classifiers T-L, k ^ 'N, and S = (xi, . . . , Xk) G X'', we say Ti shatters 
S if, V(yi, ... ,yk) G {-1, +1}^, 3h £ Ti such that Vi G {1, . . . , k}, h{xi) = yi, equivalently, % 
shatters S if 3{/ii, . . . , /i2fe} ^% such that for each i, j G {1, . . . , 2'^} with i / j, 3£ G {1, . . . , k} 
with hi{xf) 7^ hj{xi). To simplify notation, we will also say that T-L shatters if and only if 
H / {}. As usual, we def ine the VC dim ension of C, denoted d, as the largest integer k such that 
3S G X'' shattered by C ( Vapnik . 1998h . To focus on nontrivial problems, we will only consider 
concept spaces C with d > in the results below. Generally, any such concept space C with d < oo 
is called a VC class. 



2.1 Motivating Examples 

Throughout this paper, we will repeatedly refer to a few canonical examples. Although themselves 
quite toy-like, they represent the boiled-down essence of some important distinctions between vai^- 
ious types of leai^ning problems. In some sense, the process of grappling with the fundamental 
distinctions raised by these types of examples has been a driving force behind much of the recent 
progress in understanding the label complexity of active learning. 

The first example is perhaps the most classic, and is clearly the first that comes to mind when 
considering the potential for active learning to provide strong improvements over passive learning. 

Example 1 In the problem of learning threshold classifiers, we consider X = [0, 1] and 

C = {/i.(x) = l|^^,j(x):zG(0,l)}. o 

There is a simple universal activizer for threshold classifiers, based on a kind of binary search. 
Specifically, suppose n G N and that Ap is any given passive leai^ning algorithm. Consider the points 
in {Xi,X2, ■ ■ ■ ,^m}> for m = 2"~^, and sort them in increasing order: X(i),X(2), • • • ,^(m)- 
Also initialize ^ = and u = m + 1, and define X(q) = and X(^^^) = 1. Now request the 
label of for i = [(^ + u)/2\ (i.e., the median point between I and u); if the label is — 1, 
let £ = i, and otherwise let u = i; repeat this (requesting this median point, then updating £ or 
u accordingly) until we have u = £ + 1. Finally, let z = construct the labeled sequence 

C = {{Xi, h^{Xi)) , . . . , {Xm, h^{Xm))}, and return the classifier h = Ap{C). 

Since each label request at least halves the set of integers between £ and u, the total number of 
label requests is at most log2(?n,) + 1 = n. Supposing / G C is the target function, this procedure 
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maintains the invariant that f{X(^(^) = —1 and f{X(^u)) = +1- Thus, once we reach u = £ + 1, 
since / is a threshold, it must be some hz with z G {£,u\; therefore every with j < t has 
= —1, and likewise every with j > u has f{Xy^) = +1; in particular, this means £ 
equals Zm, the true labeled sequence. But this means h = Ap{Zm)- Since n = log2(m) + 1, this 
active learning algorithm will achieve an equivalent error rate to what Ap achieves with m labeled 
examples, but using only log2(m) + 1 label requests. In particular, this implies that if Ap achieves 
label complexity Ap, then this active learning algorithm achieves label complexity such that 
K{e,f,V) < log2 Ap(e,/,P) + 2; as long as K Kp{e,f,V) < oo, this is o{Ap{e, f,V)), so that 
this procedure activizes Ap for C. 

The second example we consider is almost equally simple (only increasing the VC dimension 
from 1 to 2), but is far more subtle in terms of how we must approach its analysis in active learning. 



Example 2 In the problem of learning interval classifiers, we consider X = [0, 1] and 

[a,6]( 



C = {hya,b]{x) = it Ax) : < a < 6 < 1}. 



For the intervals problem, we can also construct a universal activizer, though slightly more com- 
plicated. Specifically, suppose again that n G N and that Ap is any given passive learning algorithm. 
We first request the labels {Yi, • • • , ^\n/2\ } of the first \n/2 \ examples in the sequence. If every 
one of these labels is —1, then we immediately return the all-negative constant classifier h{x) = — 1. 
Otherwise, consider the points {Xi,X2, ■ ■ ■ , X^}, for m = max {2L"/^J~^, n}, and sort them in 
increasing order X(i),X(2), . . . ,-^(m)- For some value i G {!,..., [n/2]} with Yi = +1, let j+ 
denote the corresponding index j such that = Xi. Also initialize ii = 0, ui = £2 = j+, 
and U2 = m + 1, and define X(o) = and = 1. Now if ^1 + 1 < ui, request the la- 

bel of for i = [(^1 + ui)/2\ (i.e., the median point between li and ui); if the label is — 1, 
let li = i, and otherwise let ui = i; repeat this (requesting this median point, then updating li 
or ui accordingly) until we have ui = £1 + 1. Now if £2 + 1 < U2, request the label of 
for i = [(^2 + U2)/2\ (i.e., the median point between £2 and U2); if the label is —1, let U2 = i, 
and otherwise let £2 = 1', repeat this (requesting this median point, then updating U2 or £2 accord- 
ingly) until we have U2 = £2 + 1- Finally, let a = ui and h = £2, construct the labeled sequence 
£ = I (^Xi, /ij. [Xi)^ , . . . , (^X„^, /ij. (X„^)^ j, and return the classifier h = Ap{C). 

Since each label request in the second phase halves the set of values between either £1 and ui 
or £2 and U2, the total number of label requests is at most min{m, \n/2\ + 21og2(m) + 2} < n. 
Suppose / G C is the tai^get function, and let vj{f) = V{x : f{x) = +1). If w{f) = 0, then with 
probability 1 the algorithm will return the constant classifier h{x) = —1, which has er(/i) = in 
this case. Otherwise, if w{f) > 0, then for any n > In i, with probability at least 1 — e, there 
exists i G {1, . . . , [n/2] } with Yi = +1. Let denote the event that such an i exists. Supposing 
this is the case, the algorithm will make it into the second phase. In this case, the procedure main- 
tains the invariant that f{X^(,^)) = -1, f{X^^^)) = /(X(^2)) = +1, and f{X^^^)) = -1, where 
£1 < ui < £2 < U2. Thus, once we have ui = ^1 + 1 and U2 = £2 + 1» since / is an interval, it 
must be some h[a,b] with a G {£i,ui] and b G [^2, 1^1); therefore every with j < £1 or j > U2 
has = —1, and likewise every X(^j^ with ui < j < £2 has f{X(^j^) = +1; in particu- 

lar, this means C equals Zm, the true labeled sequence. But this means h = Ap{Zm)- Suppos- 
ing Ap achieves label complexity Ap, and that n > max |8 + 41og2 Ap(e, /, V), -^^pj e }> then 

m > 2L"/4j-i > Ap{e,f,V) andE [er(/i)l < E \ey:{h)lH+] +(1-P(if+)) < E [er(^p(2:„))] + 
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e < 2e. In particular, this means this active learning algorithm achieves label complexity 
such that, for any / G C with w{f) = 0, Aa(2e,/,-p) = 0, and for any / G C with w(j) > 

0,Aa{2e,f,V) < max{8 + 41og2Ap(e,/,P),^lni}. If {f,V) G Nontrivial(Ap), then 

^Ini = o{Ap{e,f,r)) and 8 + 41og2Ap(e,/,P) = o{Ap{eJ,r)), so that A„(2e, /, = 
o(Ap(e, f,V)). Therefore, this procedure activizes Ap for C. 

This example also brings to light some interesting phenomena in the analysis of the label com- 
plexity of active learning. Note that unlike the thresholds example, we have a much stronger de- 
pendence on the target function in these label complexity bounds, via the w{f ) quantity. This 
issue is fundamental to the problem, and cannot be avoided. In particular-, when V{[0,x]) is 
continuous, this is the very issue that makes the minimax la bel complexity f or this problem (i.e.. 



minA„ maxjgc ^a{^^ f, ^)) no better than passive learning (IDasguptal l2005b . Thus, this problem 



emphasizes the need for any informative label complexity analyses of active learning to explicitly 



describe the dependence of the label complexity on the tai^get function, as advocated by iDasgupta 



(I2005h . This example also highlights the unverifiabilitv phenomenon explored by Balcan. Hanneke, 



and Vaughan (l2010l) . since in the case of w{f) = 0, the error rate of the returned classifier is zero, 



but (for nondegenerate V) there is no wa y for the algorithm to verify this fact b ased only on the 
finite number of labels it observes. In fact, iBalcan. Hanneke. and Vaughan have shown that 



under continuous V, for any / G C with w{f) = 0, the number of labels required to both find a 
classifier of small error rate and verify that the error rate is small based only on observable quantities 
is essentially no better than for passive learning. 

These issues are present to a small degree in the intervals example, but were easily handled 
in a very natural way. The target-dependence shows up only in an initial phase of waiting for a 
positive example, and the always-negative classifiers were handled by setting a default return value. 
However, we can amplify these issues so that they sh ow up in more subtle and invo l ved vy ays. 
Specifically, consider the following example, studied by iBalcan. Hanneke. and Vaughan (I2OIOI) . 



Example 3 In the problem of learning unions of i intervals, we consider X = [0,1] and 
C=\h^{x) = l^ ,(x):0<zi<Z2<...<Z2i<l 

The challenge of this problem is that, because sometimes Zj = zj^i for some j values, we 
do not know how many intervals are required to minimally represent the target function: only that 
it is at most i. This issue will be made clearer below. We can essentially think of any effective 
strategy here as having two components: one component that searches (perhaps randomly) with the 
purpose of identifying at least one example from each decision region, and another component that 
refines our estimates of the end-points of the regions the first component identifies. Later, we will 
go through the behavior of a universal activizer for this problem in detail. 

3. Disagreement-Based Active Learning 

At present, perhaps the best-understood active learning algorithms are those choosing their label 
requests based on disagreement among a set of remaining candidate classifiers. The canonical algo- 
rithm of thi s tvpe. a version of which we discuss below in Sectionl5.1l was proposed bv Cohn. Atlas, 



and Ladner (119941) . Specifically, for any set 7i of classifiers, define the region of disagreement: 
DIS(H) = {x e X :3hi,h2 en s.t. hi{x) ^ h2{x)} . 
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The basic idea of disagreement-based algorithms is that, at any given time in the algorithm, 
there is a subset y C C of remaining candidates, called the version space, which is guaranteed 
to contain the target /. When deciding whether to request a particular label Yi, the algorithm 
simply checks whether Xi G DlS(y): if so, the algorithm requests Yi, and otherwise it does 
not. This general strategy is reasonable, since for any Xi ^ DIS(F), the label agreed upon by V 
must be f{Xi), so that we would get no information by requesting Yi; that is, for Xi ^ DlS(y), 
we can accurately infer Yi based on information already available. This type of algorithm has 
recently received substantial attention, not only for its obvious elegance and simplicity, but also 
because (as we discuss in Section O there are natural ways to extend the technique to the general 
problem of learning with label noise and model misspecification (the agnostic setting). The details 
of disagreement-based algorithms can vary in how they update the set V and how frequently they do 
so, but it turns out almost all disagreement-based algorithms share many of the same fundamental 
properties, which we describe below. 

3.1 A Basic Disagreement-Based Active Learning Algorithm 

In Section l5?n we discuss several known results on the label complexities achievable by these types 
of active leai^ning algorithms. However, for now let us examine a very basic algorithm of this type. 
The following is intended to be a simple representative of the family of disagreement-based active 
learning algorithms. It has been stripped down to the bare essentials of what makes such algorithms 
work. As a result, although the gap between its label complexity and that achieved by passive 
leai^ning is not necessarily as large as those achieved by the more sophisticated disagreement-based 
active learning algorithms of Section ISTTl it has the property that whenever those more sophisticated 
methods have label complexities asymptotically superior to those achieved by passive learning, that 
guarantee will also be true for this simpler method, and vice versa. The algorithm operates in only 
2 phases. In the first, it uses one batch of label requests to reduce the version space F to a subset of 
C; in the second, it uses another batch of label requests, this time only requesting labels for points 
in DlS(y). Thus, we have isolated precisely that aspect of disagreement-based active learning that 
involves improvements due to only requesting the labels of examples in the region of disagreement. 
The procedure is formally defined as follows, in terms of an estimator Pn(T)lS{V)) specified below. 



Meta- Algorithm 




Input: passive algorithm Ap, label budget n 




Output: classifier h 




0. Request the first [n/2\ labels {Yi, . . . , Y[n/2J }> and let t ^ 


- [n/2\ 


1. Let F = {/i G C : er^n/aj {h) = 0} 




2. Let A ^ Pn(DlS{V)) 




3. Let £ ^ {} 




4. For m = [n/2\ + 1, . . . [n/2\ + [n/{AA)\ 




5. If Xm. G DIS(T/) and t < n, request the label Ym of Xjr 


, and let y ■(^ Ym and t <^ t + 1 


6. Else let y ^ h{Xm) for an arbitrary h ^ V 




7. LetC^ CU{{X„„y)} 




8. Return Ap{C) 
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Meta- Algorithm depends on a data-dependent estimator P„(DIS(y)) of V{DIS{V)), which 
we can define in a variety of ways using only unlabeled examples. In particular, for the theorems 
below, we will take the following definition for P„(DIS(y)), designed to be a confidence upper 
bound on 'P(DIS(y)). LetUn = {^„2_,_i, . . . , X2n2}- Then define 



Meta-Algorithm is divided into two stages: one stage where we focus on reducing V, and 
a second stage where we construct the sample C for the passive algorithm. This might intuitively 
seem somewhat wasteful, as one might wish to use the requested labels from the first stage to 
augment those in the second stage when constructing C, thus feeding all of the observed labels 
into the passive algorithm Ap. Indeed, this can improve the label complexity in some cases (albeit 
only by a constant factor); however, in order to get the general property of being an activizer for 
all passive algorithms Ap, we construct the sample £ so that the conditional distribution of the X 
components in C given \C\ is "Pl^l, so that it is (conditionally) an i.i.d. sample, which is essential 
to our analysis. The choice of the number of (unlabeled) examples to process in the second stage 
guarantees (by a Chernoff bound) that the "t < n" constraint in Step 5 is redundant; this is a trick 
we will employ in several of the methods below. As explained above, because / G this implies 
that every {x, y) ^ C has y = f{x). 

To give some basic intuition for how this algorithm behaves, consider the example of learning 
threshold classifiers (Example [Hi; to simplify the explanation, for now we ignore the fact that P„ 
is only an estimate, as well as the "t < n" constraint in Step 5 (both of which will be addressed 
in the general analysis below). In this case, suppose the target function is / = h^. Let a = 
max{Xi : Xi < z,l < i < [n/2\] and b = min{Xi : Xi > z,l < i < [n/2\]. Then 
V = {hz' : a < z' < b} and DlS(y) = (a, 6), so that the second phase of the algorithm only 
requests labels for a number of points in the region (a, b). With probability 1 — e, the probability 
mass in this region is at most 0(log(l/e)/n), so that \C\ > in,£ = ^{'n? / log(l/e)); also, since the 
labels in C we. all correct, and the Xm values in £ are conditionally iid (with distribution V) given 
|£|, we see that the conditional distribution of £ given |£| = ^ is the same as the (unconditional) 
distribution of Z^. In particular, if Ap achieves label complexity Aj,, and /i„ is the classifier returned 
by Meta-Algorithm applied to Ap, then for any n = (Y^Ap(e, f,V) log(l/e)) chosen so that 
4,£ > Ap(e, /, V), we have 



This indicates the active learning algorithm achieves label complexity Aa with Aa{2e, f,V) = 



O (VAp(e,/,P)log(l/e)). In particular-, if oo > Ap(e, /, = a;(log(l/e)), then A„(2e, /, = 



o{Ap{e, f,V)). Therefore, Meta-Algorithm is a universal activizer for the space of threshold 



In contrast, consider the problem of learning interval classifiers (Example |2l). In this case, 
suppose the target function / has V{x : f{x) = +1) = 0, and that V is uniform in [0, 1]. Since 
(with probability one) every Yi = —1, we have V = {h^a,b] '■ {Xi, ■ ■ ■ , X^n/2i } ^ [^i ^] = 
But this contains classifiers /i[a,a] for every a G (0, 1) \ {Xi, . . . ,X^n/2\}^ so that DlS(y) = 
(0, 1) \ {Xi, . . . ,Xl„/2j}. Thus, V(DIS{V)) = 1, and |£| = 0(n); that is, Ap gets run with 




(1) 




classifiers. 
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no more labeled examples than simple passive learning would use. This indicates we should not 
expect Meta-Algorithm to be a universal activizer for interval classifiers. Below, we formalize 
this, by constructing a passive leai^ning algorithm Ap that Meta-Algorithm does not activize for 
this scenario. 

3.2 The Limiting Region of Disagreement 

In this subsection, we generalize the examples from the previous subsection. Specifically, we prove 
that the performance of Meta-Algorithm is intimately tie d to a particular Umiting set, referred to 



as the disagreement core. A similar definition was given by lBalcan. Hanneke. and VaughanI (120 lOh 
(there refeiTcd to as the bound ary, for reasons that wil l become clear below); it is also related to 
certain quantities in the work of lHanneke (l2007bL I2OI 11) described below in Section [STT] 



Definition 4 Define the disagreement core of a classifier f with respect to a set of classifiers % and 
distribution P as 

%p/=limDIS(B^,p(/,r)). ^ 

When P = V, the true distribution on X, and V is clear from the context, we abbreviate this as 
dnf = d'n,'pf'i if additionally H = C, the full concept space, which is clear from the context, we 
further abbreviate this as df = dcf = dc,vf- 

As we will see, disagreement-based algorithms often tend to focus their label requests around 
the disagreement core of the target function. As such, the concept of the disagreement core will be 
essential in much of our discussion below. We therefore go through a few examples to build intuition 
about this concept and its properties. Perhaps the simplest example to start with is C as the class 
of threshold classifiers (Example [T]), under V uniform on [0, 1]. For any /i^ G C and sufficiently 
small r > 0, B(/, r) = {/i^/ : \z' — z\ < r}, and DIS(B(/, r)) = [z — r, z + r). Therefore, 
dhz = lim DIS(B(/i2, r)) = lim[z — r, z + r) = {z}. Thus, in this case, the disagreement core 

of hz with respect to C and V is precisely the decision boundary of the classifier. As a slightly 
more involved example, consider again the example of interval classifiers (ExampleO, again under 
V uniform on [0, 1]. Now for any G C with 6 — a > 0, for any sufficiently small r > 0, 

B(/i[a,6],0 = {V.fc'l • + < r},andDIS(B(/i[„_;,],r)) = [a-r, a + r)U (6-r, 6 + r]. 

Therefore, dhiab] = 1™ DIS(B(/ira y , r)) = lim[a — r,a + r) U (b — r,b + r] = {a,b}. Thus, 

in this case as well, the disagreement core of h[a,b] with respect to C and V is again the decision 
boundary of the classifier. 

As the above two examples illustrate, df often corresponds to the decision boundary of / in 
some geometric interpretation of X and /. Indeed, under fairly general conditions on C and V, 
the disagreement core of / does correspond to (a subset of) the set of points dividing the two label 



regions of /; for instance, iFriedmanI (120091) derives sufficient conditions, under which this is the 
case. In these cases, the behavior of disagreement-based active learning algorithms can often be 
interpretted in the intuitive terms of seeking label requests near the decision boundary of the target 
function, to refine an estimate of that boundary. However, in some more subtle scenarios this is no 
longer the case, for interesting reasons. To illustrate this, let us continue the example of interval 
classifiers from above, but now consider /i[a,a] ^[a,6] with a = b). This time, for any r G (0, 1) 
we have B(/i[(ja],?^) = {^[a',6'] G C : 6' — a' < r}, and DlS{B{h[a,a]ji")) = (0,1)- Therefore, 
dhia,a] = limBlS(B{h[a,a],r)) = lim(0, 1) = (0, 1). 
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This example shows that in some cases, the disagreement core does not correspond to the de- 
cision boundary of the classifier, and indeed has V{df) > 0. Intuitively, as in the above example, 
this typically happens when the decision surface of the classifier is in some sense simpler than it 
could be. For instance, consider the space C of unions of two intervals (Example [3] with i = 2) 
under uniform V. The classifiers / € C with V{df) > are precisely those representable (up to 
probability zero differences) as a single interval. The others (with < zi < 2:2 < -Z3 < 24 < 1) 
have dh^ = {z\, Z2, 2:3, 24}. In these examples, the / G C with V{df) > are not only simpler 
than other nearby classifiers in C, but they are also in some sense degenerate relative to the rest of 
C; however, it turns out this is not always the case, as there exist scenarios (C, V), even with d = 2, 
and even with countable C, for which every / € C has V{df) > 0; in these cases, every classifier 
is in some important sense simpler than some other subset of nearby classifiers in C. 

In Section 13. 3[ we show that the label complexity of disagreement-based active learning is in- 
timately tied to the disagreement core. In particular, scenarios where V{df) > 0, such as those 
mentioned above, lead to the conclusion that disagreement-based methods are sometimes insuffi- 
cient for activized learning. This motivates the design of more sophisticated methods in Section HI 
which overcome this deficiency, along with a corresponding refinement of the definition of "dis- 
agreement core " in Section|5]2]that eliminates the above issue with "simple" classifiers. 

3.3 Necessary and Sufficient Conditions for Disagreement-Based Activized Learning 

In the specific case of Meta- Algorithm 0, for large n we may intuitively expect it to focus its second 
batch of label requests in and around the disagreement core of the target function. Thus, whenever 
V{df) = 0, we should expect the label requests to be quite focused, and therefore the algorithm 
should achieve higher accuracy compared to passive leai^ning. On the other hand, if V{df) > 0, 
then the label requests will not become focused beyond a constant fraction of the space, so that the 
improvements achieved by Meta-Algorithm over passive learning should be, at best, a constant 
factor. This intuition is formaUzed in the following general theorem, the proof of which is included 
in Appendix lAl 

Tlieorem 5 For any VC class C, Meta-Algorithm is a universal activizer for C if and only if every 
/ G C and distribution V has V (9c, "p/) = 0. o 

While the formal proof is given in Appendix |Al the general idea is simple. As we always have 
/ G y, any y infenxd in Step 6 must equal f{x), so that all of the labels in £ are con^ect. Also, as n 
grows large, classic results on passive learning imply the diameter of the set V will become smal l, 
shrinking to zero as n — 00 (|VapnikL 1 19821 : iBlumer. Ehrenfeucht. Haussler. and WarmuthLflQsi^ . 



Therefore, as n — 00, DIS(F) should converge to a subset of 9/, so that in the case V{df) = 0, 
we have A — )• 0; thus \C\ S> n, which implies an asymptotic strict improvement in label complexity 
over the passive algorithm Ap that C is fed into in Step 8. On the other hand, since df is defined by 
classifiers arbitrarily close to /, it is unlikely that any finite sample of correctly labeled examples can 
contradict enough classifiers to make DlS(y) significantly smaller than df, so that we always have 
V(DIS{V)) > V{df). Therefore, if V{df) > 0, then A converges to some nonzero constant, so 
that \C\ = 0{n), representing only a constant factor improvement in label complexity. In fact, as is 
implied from this sketch (and is proven in Appendix lAl). the targets / and distributions V for which 
Meta-Algorithm achieves asymptotic strict improvements for all passive learning algorithms (for 
which / and V are nontrivial) are precisely those (and only those) for which V{dic,pf) = 0. 
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There are some general conditions under which the zero-probabiUty disagreement cores condi- 
tion of Theorem [5] will hold. For instance, it is not difficult to show this will always hold when X 
is countable; furthermore, with some effort one can show it will hold for most classes having VC 
dimension one (e.g., any countable C with d = 1). However, as we have seen, not all spaces C 
satisfy this zero-probability disagreement cores property. In particular, for the interval classifiers 
studied in Section [3^ we have Vidh^a^a]) = ^((0) 1)) = 1- Indeed, the aforementioned special 
cases aside, for most nontrivial spaces C, one can construct distributions V that in some sense mimic 
the intervals problem, so that we should typically expect disagreement-based methods will not be 
activizers. For detailed discussions of various scenarios where the P Or. -p f) = condit ion is (or 
is not) satisfied for var ious C. V. and f. s ee the works of Hannek d (l2009bl. l2007bl l201 11): Balcan. 



Hanneke, and Vaughan lioioh : iFriedmari (120091) : iWang fl2009L 12011 



). 



4. Beyond Disagreement: A Basic Activizer 

Since the zero-probability disagreement cores condition of Theorem[5]is not always satisfied, we are 
left with the question of whether there could be other techniques for active learning, beyond simple 
disagreement-based methods, which could activize every passive learning algorithm for every VC 
class. In this section, we present an entirely new type of active learning algorithm, unlike anything 
in the existing literature, and we show that indeed it is a universal activizer for any class C of finite 
VC dimension. 



4.1 A Basic Activizer 

As mentioned, the case V{df) = is already handled nicely by disagreement-based methods, since 
the label requests made in the second stage of Meta- Algorithm will become focused into a small 
region, and C therefore grows faster than n. Thus, the primary question we are faced with is what 
to do when V{df) > 0. Since (loosely speaking) we have DlS(y) — )• df in Meta- Algorithm 0, 
V{df) > corresponds to scenarios where the label requests of Meta- Algorithm will not become 
focused beyond a certain extent: specifically, since V(DIS{V) ® df) — almost surely (where 
© is the symmetric difference), Meta-Algorithm will request labels for a constant fraction of the 
examples in C. 

On the one hand, this is definitely a major problem for disagreement-based methods, since it 
prevents them from improving over passive learning in those cases. On the other hand, if we do 
not restrict ourselves to disagreement-based methods, we may actually be able to exploit properties 
of this scenario, so that it works to our advantage. In particular, since V(DIS{V) © 9c/) — ^ 
and V{dvf © dcf) = (almost surely) in Meta-Algorithm 0, for sufficiently lai^ge n a ran- 
dom point xi in DlS(y) is likely to be in dyf. We can exploit this fact by using xi to split 
V into two subsets: y[(xi,+l)] and 1)]. Now, if xi G dyf, then (by definition of 

the disagreement core) inf er(/i) = inf er(/i) = 0. Therefore, for almost every 

feey[(xi,+i)] h(iV[{xu~i)] 

point X ^ Y)\^{V[{xi, +1)]), the label agreed upon for x by classifiers in V[{xi, +1)] should be 
f{x). Similarly, for almost every point x ^ DIS(y[(xi, — 1)]), the label agreed upon for x by 
classifiers in 1)] should be f{x). Thus, we can accurately infer the label of any point 

X ^ DIS(y [(xi, +1)]) n DIS(y [(xi, -l)]) (except perhaps a probability zero subset). With these 
sets F[(xi, +1)] and ^[(xi, —1)] in hand, there is no longer a need to request the labels of points 
for which either of them has agreement about the label, and we can focus our label requests to the 
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region mS{V[{xi, +1)]) n DIS(y[(xi, -1)]), which may be much smaller than DlS(y). Now if 
V(DlS{V[{xi, +1)]) n DIS(F[(xi, -1)])) 0, then the label requests will become focused to a 
shrinking region, and by the same reasoning as for Theorem [5] we can asymptotically achieve strict 
improvements over passive learning by a method analogous to Meta- Algorithm (with changes as 
described above). 



Already this provides a significant improvement over disagreement-based methods in many 
cases; indeed, in some cases (such as intervals) this already addresses the nonzero-probability 
disagreement core issue in Theorem |5] In other cases (such as unions of two intervals), it does 
not completely address the issue, since for some targets we do not have 'P(DIS(F[(a;i, +1)]) n 
DIS(y[(xi, — 1)])) — )■ 0. However, by repeatedly applying this same reasoning, we can ad- 
dress the issue in full generality. Specifically, if P(DIS(y[(xi, +1)]) n DIS(F[(a;i, -1)])) ^ 
0, then DIS(F[(2;i, +1)]) n DIS(F[(a;i, — 1)]) essentially converges to a region (9c[(xi,+i)]/ H 
(?C[(xi, -!)]/> which has nonzero probability, and is nearly equivalent to H (?y[(2^^^ 

Thus, for sufficiently large n, a random X2 in DIS(y [(xi, +1)]) n DIS(V^[(xi, — 1)]) will likely 
be in n f^v[(xi.-i)]/- this case, we can repeat the above ai^gument, this time split- 

ting V into 'four sets (F[(xi, +I)][(x2, +1)], V[{xi,+l)][{x2, -I)], -l)][(x2, +1)], and 

— l)][(a::2, —1)]), each with infimum error rate equal zero, so that for any point x in the re- 
gion of agreement of any of these four sets, the agreed-upon label will (almost surely) be fix), so 
that we can infer that label. Thus, we need only request the labels of those points in the intersection 
of all four regions of disagreement. We can further repeat this process as many times as needed, 
until we get a partition of V with shrinking probability mass in the intersection of the regions of 
disagreement, which (as above) can then be used to obtain asymptotic improvements over passive 
learning. 



Note that the above argument can be written more concisely in terms of shattering. That is, 
any x G DlS(y) is simply an x such that V shatters {x}; a point x € DIS(F[(xi, +1)]) n 
DIS(F[(a;i, —1)]) is simply one for which V shatters {xi,x}, and for any x ^ DIS(y[(xi, +1)]) n 
DIS(F[(a;i, —1)]), the label y we infer about x has the property that the set V[{x, —y)] does not 
shatter {xi}. This continues for each repetition of the above idea, with x in the intersection of 
the four regions of disagreement simply being one for which V shatters {xi,X2,x}, and so on. In 
particular, this perspective makes it clear that we need only repeat this idea at most d times to get 
a shrinking intersection region, since no set of d + 1 points is shatterable. Note that there may 
be unobservable factors (e.g., the target function) determining the appropriate number of iterations 
of this idea sufficient to have a shrinking probability of requesting a label, while maintaining the 
accuracy of inferred labels. To address this, we can simply try all d + 1 possibilities, and then select 
one of the resulting d + I classifiers via a kind of tournament of pairwise comparisons. Also, in 
order to reduce the probability of a mistaken inference due to xi ^ dyf (or similarly for later Xj), 
we can replace each single x j with multiple samples, and then take a majority vote over whether to 
infer the label, and which label to infer if we do so; generally, we can think of this as estimating 
certain probabilities, and below we write these estimators as Pm, and discuss the details of their 
implementation later. Combining Meta- Algorithm with the above reasoning motivates a new type 
of active learning algorithm, referred to as Meta-Algorithm 1 below, and stated as follows. 
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Meta-Algorithm 1 

Input: passive algorithm Ap, label budget n 
Output: classifier h 



0. Request the first m„ = [n/3j labels, {li, . . . , Ym„}, and let t ^ rrin 

1. Let y = {/i G C : er„„(/i) = 0} 

2. ForA: = l,2,...,(i + l 

3. AC^) ^ (x:P[Se X^-^ : V shatters S U {x}|T/ shatters 5) > I/2) 

4. Let £fe ^ {} 

5. For m = m„ + 1, . . . , m„ + [n/(6 • 2'=AW)J 

6. If P,„ (S G Af'^-i : V shatters 5 U {Xm]\V shatters S) > 1/2 and t < [2n/3\ 

7. Request the label Y^n of X^, and let y •(— and i t + 1 

8. Else, let y ^ argmax Pm{S G X''^^ -.VHXm, -y)] does not shatter 5*1^ shatters S) 

j/e{-i,+i} 

9. LetCk^ CkU{{Xm.,y)} 

10. Return ActiveSelect({^p(/:i), ^p(£2), • • • , Ap{£.d+i)}, [n/3j , {Xm„+maxfc |£fc|+i, • • •}) 



Subroutine: ActiveSelect 

Input: set of classifiers {hi, /12, . . . , hj\[}, label budget m, sequence of unlabeled examples U 
Output: classifier h 



0. For each j, G {1, 2, . . . , iV} s.t. j < k, 



1. Let be the first 



points in^nja; : hj{x) / /ifc(x)} (if such values exist) 



_j{N-j)\n{eN)^ 

2. Request the labels for Rjf^ and let Qj^ be the resulting set of labeled examples 

3. LetTUkj = eiQ^^ihk) 

4. Return where k = max {A; G {1, . . . , N} : maxj^k n^kj < 7/12} 



Meta-Algorithm 1 is stated as a function of three types of estimated probabiUties: namely, 



PmiSG X"-^ : V shatters S U {x} 



V shatters S ) , 



V shatters 5 , 



Pm e ^ : V[{x, -y)] does not shatter S 
and Pm(x: P(^S £ X^'^ : V shatters S U {x} V shatters s'j > l/2j . 

These can be defined in a variety of ways to make this a universal activizer. Generally, the only 
requirement seems to be that they converge to the appropriate respective probabilities in the limit. 
For the theorem stated below regarding Meta-Algorithm 1, we will take the specific definitions 
stated in Appendix IB. II 

Meta-Algorithm 1 requests labels in three batches: one to initially prune down the version 
space V, a second one to construct the labeled samples Ck, and a third batch to select among the 
d + I classifiers Ap{Ck) in the ActiveSelect subroutine. As before, the choice of the number of 
(unlabeled) examples to process in the second batch guarantees (by a Chernoff bound) that the 
"t < [2n/3j" constraint in Step 6 is redundant. The mechanism for requesting labels in the second 
batch is motivated by the reasoning outlined above, using the shatterable sets S to split V into 
2^~^ subsets, each of which approximates the target with high probability (for large n), and then 
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checking whether the new point x is in the regions of disagreement for all 2^~^ subsets (by testing 
shatterability of U {x}). To increase confidence in this test, we use many such S sets, and let 
them vote on whether or not to request the label (Step 6). As mentioned, if x is not in the region 
of disagreement for one of these 2^~^ subsets (call it V), the agreed-upon label y has the property 
that V[{x, —y)] does not shatter S (since —y)] does not intersect with V' , which represents 
one of the 2^~^ labelings required to shatter S). Therefore, we infer that this label y is the correct 
label of x, and again we vote over many such S sets to increase confidence in this choice (Step 8). 
As mentioned, this reasoning leads to correctly infen^ed labels in Step 8 as long as n is sufficiently 
large and V^^^{S G X^~^ : V shatters S) 0. In particular, we are primarily interested in the 
largest value of k for which this reasoning holds, since this is the value at which the probability of 
requesting a label (Step 7) shrinks to zero as n — )• oo. However, since we typically cannot predict 
a priori what this largest valid k value will be (as it is tai^get-dependent), we try all d + 1 values of 
k, to generate d + I hypotheses, and then use a simple pairwise testing procedure to select among 
them; note that we need at most try d + 1 values, since V definitely cannot shatter any S € X"^^^. 
We will see that the ActiveSelect subroutine is guaranteed to select a classifier with error rate never 
significantly larger than the best among the classifiers given to it (say within a factor of 2, with high 
probability). Therefore, in the present context, we need only consider whether some k has a set 
with correct labels and \ Ck\ » n. 

4.2 Examples 

In the next subsection, we state a general result for Meta- Algorithm 1. But first, to illustrate how 
this procedure operates, we walk through its behavior on our usual examples; as we did for the 
examples of Meta-Algorithm 0, to simplify the explanation, for now we will ignore the fact that 
the Prn values are estimates, as well as the "t < [2n/3j" constraint of Step 6, and the issue of 
effectiveness of ActiveSelect; in the proofs of the general results below, we will show that these 
issues do not fundamentally change the analysis. For now, we merely focus on showing that some 
k has Ck correctly labeled and \Ck\ ^ n. 

For threshold classifiers (Example [T|), we have d = I. In this case, the k = 1 round of the 
algorithm is essentially identical to Meta-Algorithm (recall our conventions that = {0}, 
V{A!^) = 1, and V shatters iff 1/ / {}), and we therefore have |£i | » n, as discussed previously, 
so that Meta-Algorithm 1 is a universal activizer for threshold classifiers. 

Next consider interval classifiers (Example ll]), with V uniform on [0, 1]; in this case, we have 
d = 2. If / = h^a,b] for o, < b, then again the k = 1 round behaves essentially the same as Meta- 
Algorithm 0, and since we have seen 'P{dh[a,b]) = in this case, we have \Ci\ » n. However, the 
behavior becomes far more interesting when / = /i[a,a]> which was precisely the case that prevented 
Meta-Algorithm from improving over passive learning. In this case, as we know from above, the 
k = 1 round will have = 0(n), so that we need to consider larger values of k to identify 
improvements. In this case, the k = 2 round behaves as follows. With probability 1, the initial 
[n / 3j labels used to define V will all be negative. Thus, V is precisely the set of intervals that do 
not contain any of the initial [n/3j points. Now consider any S = {xi} G X^, with xi not equal 
to any of these initial [n/3\ points, and consider any x ^ {xi,Xi, . . . , ^[n/3j }■ First note that V 
shatters S, since we can optionally put a small interval around xi using an element of V. If there 
is a point x' among the initial [n/3j between x and xi, then any h^^^b] ^ ^ with x € [a, b] cannot 
also have xi G [a, b], as it would also contain the observed negative point between them. Thus, V 
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does not shatter {xi ,x} = SU{x},so that this S will vote to infer (rather than request) the label of 
X in Step 6. Furthermore, we see that V[{x, +1)] does not shatter S, while V[{x, —1)] does shatter 

5, so that this S would also vote for the label y = — 1 in Step 8. For sufficiently lai^ge n, with high 
probability, any given x not equal one of the initial [n/3j should have most (probability at least 
1 — 0{n^^ logn)) of the possible xi values separated from it by at least one of the initial [n/3\ 
points, so that the outcome of the vote in Step 6 will be a decision to infer (not request) the label, 
and the vote in Step 8 will be for —1. Since, with probability one, every Xm a, we have every 
Yfn = —1, so that every point in £2 is labeled coiTcctly. This also indicates that, for sufficiently 
large n, we have V{x -.V^iS £ -.V shatters S'U shatters S) > 1/2) = 0, so that the size 
of C2 is only limited by the precision of estimation in Pm„ in Step 3. Thus, as long as we implement 
Pm„ SO that its value is at most o(l) larger than the true probability, we can guarantee \C2\ ^ n. 

The unions of i intervals example (Example [S]), again under V uniform on [0, 1], is slightly 
more involved; in this case, the appropriate value of k to consider for any given target depends 
on the minimum number of intervals necessary to represent the target function (up to probability- 
zero differences). If j intervals are required for this, then the appropriate value is k = i — j + 1. 
Specifically, suppose the tai^get is minimally representable as a union of j G {1, . . . ,i} intervals 
of nonzero width: [zi,Z2] U [-23,-24] U • • • U [z2j-i,Z2j]- that is, zi < 2:2 < . . . < -22^-1 < -22^- 
Every target in C has distance zero to some classifier of this type, and will agree with that classifier 
on all samples with probability one, so we lose no generality by assuming all j intervals have 
nonzero width. Then consider any x £ (0, 1) separated from each of the Zp values by at least one 
of the initial [n / 3j points, and not itself equal to one of those initial points. Further consider any 
S = {xi, . . . , Xi^j} G X'^~^ such that, between any pair of elements of 5" U {x} U {zi, . . . , 22^}, 
there is at least one of the initial [n/3j points. First note that V shatters S, since for any X£ not 
in one of the [22p-i, Z2p\ intervals (i.e., negative), we may optionally add an interval [xi, xi] while 
staying in V, and for any X£ in one of the [22^-1, Z2p\ intervals (i.e., positive), we may optionally 
split [22p-i, 22p] into two intervals to barely exclude the point xg (and a small neighborhood around 
it), by adding at most one interval to the representation; thus, in total we need to add at most i — j 
intervals to the representation, so that the largest number of intervals used by any of these 2*"-' 
classifiers involved in shattering is i, as required; furthermore, note that one of these 2*"-' classifiers 
actually requires i intervals. Now for any such x and S = {xi, . . . , Xi^j} as above, since one of 
the 2*"-' classifiers in V used to shatter S requires i intervals to represent it, and x is separated from 
each element of 5 U {zi, . . . , Z2j} by a labeled example, we see that V cannot shatter S U {x}. 
Furthermore, if f{x) = y, then the labeled examples to the immediate left and right of x are also 
labeled y, and in particular among the 2*"-' classifiers h from V that shatter S, the one h that requires 
i intervals to represent must also have h{x) = y, so that V[{x, —y)] does not shatter S. Thus, any 
set S satisfying this separation property will vote to infer (rather than request) the label of x in Step 

6, and will vote for the label f{x) in Step 8. Furthermore, for sufficiently large n, for any given 
x with the described property, with high probability most of the sets S G will satisfy this 
pairwise separation property, and therefore so will most of the shatterable sets S G X^~^ , so that the 
overall outcome of the votes will favor inferring the label of x, and in particular inferring the label 
f{x) for X. On the other hand, for x not satisfying this property (i.e., not separated from some Zp 
by any of the initial [n/3j examples), for any set 5 as above, V can shatter S U {x}, since we can 
optionally increase or decrease Zp to include or disclude x from the associated interval, in addition 
to optionally adding the extra intervals to shatter S; therefore, by the same reasoning as above, for 
sufficiently large n, any such x will satisfy the condition in Step 6, and thus have its label requested. 
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Thus, for sufficiently large n, every example in will be labeled correctly. Finally, note that 

with probability 1, the set of points x separated from each of the Zp values by at least one of the 
[n/3j initial points has probability approaching 1 as n — )• cxd, so that again we have » n. 

The above examples give some intuition about the operation of this procedure. Next, we turn to 
general results showing that this type of improvement generally holds. 

4.3 General Results on Activized Learning 

Returning to the abstract setting, we have the following general theorem, representing one of the 
main results of this paper. Its proof is included in Appendix iBl 

Theorem 6 For any VC class C, Meta-Algorithm 1 is a universal activizer for C. o 



This result is interesting both for its strength and generality. Recall that it means that given any 
passive learning algorithm Ap, the active leai-ning algorithm obtained by providing Ap as input to 
Meta-Algorithm 1 achieves a label complexity that strongly dominates that of Ap for all nontrivial 
distributions V and target functions / G C. Results of this type were not previously known. The 
specifi c technical advance over existing results (namely, those of iBalcan. Hanneke. and Vaughan 
(|2010h ') is the fact that Meta-Algorithm 1 has no direct de pendence on the distribution V; as men - 
tioned earlier, the (very different) approach proposed by iBalcan. Hanneke. and VaughanI (120 lOh 
has a strong direct dependence on the distribution, to the extent that the distribution-dependence 
in that approach cannot be removed by merely replacing certain calculations with data-dependent 
estimators (as we did in Meta-Algorithm 1). In the proof, we actually show a somewhat more gen- 
eral result: namely, that Meta-Algorithm 1 achieves these asymptotic improvements for any target 
function / in the closure of C (i.e., any / such that Vr > 0, B(/, r) ^ 0). 

The following corollary is one concrete implication of Theorem [6l 



Corollary 7 For any VC class C, there exists an active learning algorithm achieving a label com- 
plexity Afl such that, for all target functions / G C and distributions V, 

K{eJ,V) = o{l/e). o 



Proof The one-inclusion graph passive learning algorithm of Haussler. Littlestone. and WarmuthI 



(119941) is known to achieve label complexity at most d/e, for every target function / G C and dis- 
tribution V. Thus, Theorem [6] implies that the (Meta-Algorithm l)-activized one-inclusion graph 
algorithm satisfies the claim. ■ 



As a byproduct. Theorem [6] also establishes the basic fact that there exist activizers. In some 
sense, this observation opens up a new realm for exploration: namely, characterizing the properties 
that activizers can possess. This topic includes a vast array of questions, many of which deal with 
whether activizers are capable of preserving various properties of the given passive algorithm (e.g., 
margin-based dimension-independence, minimaxity, admissibility, etc.). Section [7] describes a vari- 
ety of enticing questions of this type. In the sections below, we will consider quantifying how large 
the gap in label complexity between the given passive leai^ning algorithm and the resulting activized 
algorithm can be. We will additionally study the effects of label noise on the possibility of activized 
learning. 
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4.4 Implementation and Efficiency 

Meta-Algorithm 1 typically also has certain desirable efficiency guarantees. Specifically, suppose 
that for any m labeled examples Q, there is an algorithm with poly((i • m) running time that finds 
some h £ C with eiQ^h) = if one exists, and otherwise returns a value indicating that no 
such h exists in C; for many conc ept spaces with a kind of geometric interpretatio n, there are 
known methods with this capability ( Khachivanl 1979 j KarmarkM. 1984 j ValiantL 1984 ^Kearns and 



Vazirani, Il994j) . We can use such a subroutine to create an efficient implementation of the main 
body of Meta-Algorithm 1. Specifically, rather than explicitly representing V in Step 1, we can 
simply store the set Qq = {{Xi,Yi), . . . , {Xm„ , ^m„)}- Then for any step in the algorithm where 
we need to test whether V shatters a set R, we can simply try all 21^' possible labelings of R, 
and for each one temporarily add these \R\ additional labeled examples to Qo and check whether 
there is an /i € C consistent with all of the labels. At first, it might seem that these 2^ evaluations 
would be prohibitive; however, supposing Pm„ is implemented so that it is il(l/poly(n)) (as it is 
in Appendix lB.il ). note that the loop beginning at Step 5 executes a nonzero number of times only if 
n/A(*^) > 2'', so that 2^ < poly(n) ; we can easily add a condition that skips the step of calculating 
^(k) 2^ exceeds this poly(n) lower bound on nlK^^\ so that even those shatterability tests can 
be skipped in this case. Thus, for the actual occurrences of it in the algorithm, testing whether V 
shatters R requires only poly(n) • poly((i • (|Qo| + I^D) time. The total number of times this test 
is performed in calculating A^'^^ (from Appendix IB. II ) is itself only poly(n), and the number of 
iterations of the loop in Step 5 is at most njK^^^ = poly(n). Determining the label y in Step 8 
can be performed in a similar fashion. So in general, the total running time of the main body of 
Meta-Algorithm 1 is poly((i • n). 

The only remaining question is the efficiency of the final step. Of course, we can require Ap 
to have running time polynomial in the size of its input set (and d). But beyond this, we must con- 
sider the efficiency of the ActiveSelect subroutine. This actually turns out to have some subtleties 
involved. The way it is stated above is simple and elegant, but not always efficient. Specifically, 
we have no a priori bound on the number of unlabeled examples the algorithm must process before 
finding a point Xm where hj{Xm) 7^ hk{Xm)- Indeed, if V{x : hj{x) ^ hk{x)) = 0, we may 
effectively need to examine the entire infinite sequence of Xm values to determine this. Fortunately, 
these problems can be corrected without difficulty, simply by truncating the search at a predeter- 
mined number of points. Specifically, rather than taking the next [m/ (^)J examples for which hj 
and hk disagree, simply restrict ourselves to at most this number, or at most the number of such 
points among the next M unlabeled examples. In Appendix IbI we show that ActiveSelect, as orig- 
inally stated, has a high-probability (1 — exp{— guarantee that the classifier it selects has 
error rate at most twice the best of the N it is given. With the modification to truncate the search at 
M unlabeled examples, this guarantee is increased to min^ er(/ifc) + max{er(/ifc), m/M}. For the 
concrete guarantee of Corollary |7l it suffices to take M » m^. However, to guarantee the modified 
ActiveSelect can still be used in Meta-Algorithm 1 while maintaining (the stronger) Theorem [6l 
we need M at least as big as Q (min {exp {m'^} ,m/ miiifc er(/ife)}), for any constant c > 0. In 
general, if we have a l/poly(n) lower bound on the error rate of the classifier produced by Ap for a 
given number of labeled examples as input, we can set M as above using this lower bound in place 
of min/fc er(/ifc), resulting in an efficient version of ActiveSelect that still guarantees Theorem [6l 
However, it is presently not known whether there always exist universal activizers that are efficient 
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(either poly((i • n) or poly{d/e) running time) when the above assumptions on efficiency of Ap and 
finding /i G C with evQ{h) = hold. 



5. The Magnitudes of Improvements 

In the previous section, we saw that we can always improve the label complexity of a passive 
learning algorithm by activizing it. However, there remains the question of how large the gap is 
between the passive algorithm's label complexity and the activized algorithm's label complexity. 
In the present section, we refine the above procedures, to take greater advantage of the sequential 
nature of active learning. For each, we characterize the improvements it achieves relative to any 
given passive algorithm. 

As a byproduct, this provides concise sufficient condit ions for exponential gains, addressing 
an open problem of iBalcan. Hanneke. and Vaughan (I2OIOI). Specifically, cons i der th e following 



definition, essentially similar to one explored by 



Balcan. Hanneke. and Vaughan (I2OIOI) . 



Definition 8 For a concept space C and distribution V, we say that {C,V) is leai'nable at an ex- 
ponential rate ;/ there exists an active learning algorithm achieving label complexity A such that 
V/ G C, A(e, f,V) G Polylog(l/e). We further say C is leamable at an exponential rate if there 
exists an active learning algorithm achieving label complexity A such that for all distributions V 
and all f e C, A{eJ,V) G Polylog(l/e). o 



5.1 Tlie Label Complexity of Disagreement-Based Active Learning 

As before, to establish a foundation to build upon, we begin by studying the label complexity gains 
achievable by disagreement-based active learning. From above, we already know that disagreement- 
based active learning is not sufficient to achieve the best possible gains; but as before, it will serve as 
a suitable starting place to gain intuition for how we might approach the problem of improving Meta- 
Algorithm 1 and quantifying the improvements achievable over passive learning by the resulting 
more sophisticated methods. 

The results on disagreement-based learning in this subsection are essentially already known, 
and available in the published literature (tho ugh in a slightly less general fo rm). Specifically, we 
review (a modified version of) the method of ICohn. Atlas, and Ladneii (119941) . referred to as Meta- 
Algorithm 2 below, which was historically the original disagreement-based active learning algo- 
rithm. We then state the known results on the label complexities achieva ble by this method, in ter ms 
of a quantity known as the disagreement coefficient; that result is due to iHannekel (120 111 l2007bl) . 



5.1.1 The C AL Active Learning Algorithm 

To begin, we consider t he following simple disagreeme nt-based method, typically referred to as 
CAL after its discoverers ICohn. Atlas, and Ladnen (119941) . though the version here is slightly modi- 
fied compared to the original (see below). It essentially represents a refinement of Meta-Algorithm 
to take greater advantage of the sequential aspects of active learning. That is, rather than request- 
ing only two batches of labels, as in Meta-Algorithm 0, this method updates the version space after 
every label request, thus focusing the region of disagreement (and therefore the region in which it 
requests labels) after each label request. 



28 



AcTivizED Learning 



Meta- Algorithm 2 

Input: passive algorithm Ap, label budget n 
Output: classifier h 



0. V ^C,t^O,m^O,C^ {} 

1. While t < \n/2] and m < 2" 

2. m m + 1 

3. IfX„GDIS(T/) 

4. Request the label Ym of and let t t + 1 

5. Lety 

6. Let A ^ P„(DIS(y)) 

7. Do [n/(6A)J times 

8. m ^ m + 1 

9. If X„ G DlS(y) andt < n 

10. Request the label Ym of and let y ^ Ym and t ^ t + 1 

11. Else let y = h{Xm) for an aitttrary h ^ V 

12. Let /: ^ £ U y)} and F ^ y)] 

13. Return ^p(/:) 



The procedure is specified in terms of an estimator Pm.; for our purposes, we define this as in 
(fT4l) of Appendix lB . 1 K with k = I there). Every example Xm added to the set C in Step 12 either has 
its label requested (Step 10) or infeiTcd (Step 1 1). By the same Chernoff bound argument mentioned 
for the previous methods, we are guaranteed (with high probability) that the "t < n" constraint in 
Step 9 is always satisfied when Xm G DlS(y). Since we assume / € C, an inductive argument 
shows that we will always have / G F as well; thus, every label requested or inferred will agree 
with /, and therefore the labels in C are all correct. 

As with Meta-Algorithm 0, this method has two stages to it: one in which we focus on reducing 
the version space V, and a second in which we focus on c onstructing a set of labeled exam ples to 



feed into the passive algorithm. The original algorithm of ICohn. Atlas, and Ladnen (119941) essen- 
tially used only the first stage, and simply returned any classifier in V after exhausting its budget for 
label requests. Here we have added the second stage (Steps 6-13) so that we can guarantee a certain 
conditional independence (given |£|) among the exampl es fed into the p assive algorithm, which is 



important for the general results (Theorem [TO] below). iHannekd (1201 ih showed that the original 



(simpler) algorithm achieves the (less general) label complexity bound of Corollary [TT] below. 



5.1.2 Examples 

Not surprisingly, by essentially the same argument as Meta-Algorithm 0, one can show Meta- 
Algorithm 2 satisfies the claim in Theorem [51 That is, Meta-Algorithm 2 is a universal activizer 
for C if and only iiV{df) = for every V and f ^ C However, there are further results known on 
the label complexity achieved by Meta-Algorithm 2. Specifically, to illustrate the types of improve- 
ments achievable by Meta-Algorithm 2, consider our usual toy examples; as before, to simplify 
the explanation, for these examples we ignore the fact that Pm is only an estimate, as well as the 
"t < n" constraint in Step 9 (both of which will be addressed in the general results below). 

First, consider threshold classifiers (Example [B under a uniform V on [0,1], and suppose 
/ = ft,2 G C. Suppose the given passive algorithm has label complexity Ap. To get expected error at 
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most e in Meta-Algorithm 2, it suffices to have \C\ > Ap{e/2, /, V) with probabiUty at least l — e/2. 
Starting from any particular V set obtained in the algorithm, call it Vq, the set DIS(Vo) is simply the 
region between the largest negative example observed so far (say z^) and the smallest positive exam- 
ple observed so far (say z,.)- With probability at least 1 — e/n, at least one of the next 0(log(n/e)) 
examples in this [z£, Zr] region will be in [z£ + {l/3){zr — Zf), z^ — (l/3)(2;r. — z^)], so that after 
processing that example, we definitely have 'P(DIS(F)) < (2/3)'P(DIS(Vo))- Thus, upon reach- 
ing Step 6, since we have made ri/2 label requests, a union bound implies that with probability 1 — 
e/2, we have P(DIS(y)) < exp{— r2(n/log(n/e))}, and therefore |£| > exp{r2(n/ log(n/e))}. 
Thus, for some value K{e,f,V) = 0(log(Ap(e/2, /, P)) log(log(Ap(e/2, /, P))/e)), any n > 
Aa(e, /, V) gives \C\ > Ap(e/2, /, V) with probability at least l — e/2, so that the activized algo- 
rithm achieves label complexity Aa{e, f, V) G Polylog(Ap(e/2, /, V)/e). 

Consider also the intervals problem (Example |2l) under a uniform V on [0, 1], and suppose 
/ = ^[0,6] £ C, for b > a. In this case, as with any disagreement-based algorithm, until the 
algorithm observes the first positive example (i.e., the first Xm G [a, b]), it will request the label 
of every example (see the reasoning above for Meta-Algorithm 0). However, at every time after 
observing this first positive point, say x, the region DlS(y) is restricted to the region between the 
lai^gest negative point less than x and smallest positive point, and the region between the largest 
positive point and the smallest negative point larger than x. For each of these two regions, the 
same arguments used for the threshold problem above can be applied to show that, with probability 
1 — 0(e), the region of disagreement is reduced by at least a constant fraction every O(log(n/e)) 
label requests, so that \C\ > exp{J7(n/ log(n/e))}. Thus, again the label complexity is of the form 
0(log(Ap(e/2, /, V)) log(log(Ap(e/2, /, V))/e)), which is Polylog(Ap(e/2, /, V)/e), though this 
time there is a significant (additive) target-dependent constant (roughly oc log(l/e)), accounting 
for the length of the initial phase before observing any positive examples. On the other hand, as with 
any disagreement-based algorithm, when / = /i[a^a], because the algorithm never observes a positive 
example, it requests the label of every example it considers; in this case, by the same argument given 
for Meta-Algorithm 0, upon reaching Step 6 we have 'P(DIS(y)) = 1, so that \C\ = 0{n), and we 
observe no improvements for some passive algorithms Ap. 

A similar analysis can be performed for unions of i intervals under V uniform on [0, 1]. In 
that case, we find that any /iz G C not representable (up to probability-zero differences) by a 
union of i — 1 or fewer intervals allows for the exponential improvements of the type observed 
in the previous two examples; this time, the phase of exponentially decreasing V(D1S{V)) only 
occurs after observing an example in each of the i intervals and each of the i — 1 negative regions 
separating the intervals, resulting in an additive term of roughly oc ^.^^^ log{i/e) in 

the label complexity. However, any £ C representable (up to probabiUty-zero differences) by 
a union of i — 1 or fewer intervals has V{dhz) = 1, which means |£| = 0{n), and therefore (as 
with any disagreement-based algorithm) Meta-Algorithm 2 will not provide improvements for some 
passive algorithms Ap. 

5.1.3 The Disagreement Coefficient 

Toward generaUz ing the arguments from the above examples, consider the following definition of 



Hannekd (l2007bh . 
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Definition 9 For e > 0, the disagreement coefficient of a classifier f with respect to a concept 
space C under a distribution V is defined as 



9j(e) = 1 Vsup 

r>e 



T'(DIS(B(/,r))) 



Also abbreviate Of = ^/(O)- 



Informally, the disagreement coefficient describes the rate of collapse of the region of disagree- 
ment, relative to the distance from /. It has been useful in charact erizing th e label c omple xities 
achieved by several disagree rnent-based active learning algorithms (HannekeL 2007b , 2011|_ Das- 



gupta, Hsu, and Monteleoni, 2007 ; Bevgelzimer. Dasgupta. and LangfordTlOPg ; Wangl 20091 ; 



KoltchinskiiL I2OI0I ; iBeygelzimer. Hsu. Langford. and Zhangl 120101). and itsel f has been studied 
and bounded for various families of learning problems dHannekd. l2007bl I2OI1I; Balcan, Hanneke , 



and__yaughan, 2010l ; Friedmanl 2009 ; Bevgelzimer. Dasgupta. and Langford , 20091; Mahalanabis , 



201 ll ; IWangl 1201 ll) . See the paper of lHannekd (120111) for a detailed discussion of the disagreement 



coefficient, including its relationships to several related quantities, as well as a variety of properties 
that it satisfies that can help to bound its value for any given learning problem. In particular, be- 
low we use the fact that, for any constant c G [1, 00), 9f(e ) < Ofje/c) < c O tje). Also note tha t 
V{df) =0 if and only if6i/(e) = o(l/e). See the papers of iFriedmanI (I2OO9I) ; iMahalanabisI (I2OI ih 
for some general conditions on C and V, under which every / G C has Of < 00 , which (as w e 
explain below) has particularly interesting implications for active learning ( Hanneke , 2007bl 2011 ). 

To build intuition about the behavior of the disagreement coefficient, we briefly go through its 
calcu lation for our usu al toy examples fror n above. The first two of these calculations ai^e taken 
from Hanneke (l2007bh . and the last is from lBalcan. Hanneke. and VaughanI (l2010h . First, consider 
the thresholds problem (Example [T]), and for simpUcity suppose the distribution V is uniform on 
[0, 1]. In this case, as in SectionEll ^{hz,r) = {h-^, e C : \z' - z\ < r], and DIS(B(/i2, r)) C 
[z — r, z + r) with equality for sufficiently small r. Therefore, 'P(DIS(B(/i2, r))) < 2r (with 
equality for small r), and 9^. (e) < 2 with equality for sufficiently small e. In particular, 9^^ = 2. 

On the other hand, consider the intervals problem (Example |2l), again under V uniform on [0, 1]. 
This time, for h\^afi] G C with 6 — a > 0, we have for < r < 6 — a, B(/i[„ ;,], r) = {hy y^^ G C : 
\a-a'\ + \h-h'\ <V}, DIS(B(/i[^^b] , r)) C [a-r, a+r)U(6-r, 6+r], and P(DIS(B(/i[a^b] , r))) < 4r 
(with equality for sufficiently small r). But for < fe — a < r, we have B(/i[q r) ^ {\a',a'] ■ 
a' G (0, 1)}, so that DIS(B(/i[„_fe], r)) = (0, 1) and P(DIS(B(/i[„^b] , r))) = L Thus, we generally 

have (e) < max | f;^^) 4|, with equality for sufficiently small e. However, this last reasoning 
also indicates Vr > 0, B(/i[a , r) D {^[a',a'] : «' ^ (0, 1)}, so that DIS(B(/i[a,a] , J^)) = (0, 1) and 
V(DlS{B{h[a,a] 1 ^))) = 1; therefore, 9h^^ (e) = i, the largest possible value for the disagreement 
coefficient; in particular, this also means 9h[^ = 00. 

Finally, consider the unions of i intervals problem (Example again under V uniform on 
[0, 1]. First take any /iz G C such that any h^' G C representable as a union of i — 1 intervals 



has V{{x : h^{x) / h^i{x)}) > 0. Then for < 



C 



E 



< mm 

i<j<2i 



Zj, B(/iz,r) = {h^i G 



r. For r > 



< r}, so that 'P(DIS(B(/iz;, r))) < 4zr, with equality for sufficiently small 
Zj, B(/iz,r) contains a set of classifiers that flips the labels (compared 



mm 

l<j<2i 

to hz) in that smallest region and uses the resulting extra interval to disagree with on a tiny 
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4i > , with equality for sufficiently 



region at an arbitrary location (either by encompassing some point with a small interval, or by 
splitting an interval into two intervals separated by a small gap). Thus, DIS(B(/iz, r)) = (0, 1), and 

ViBlSiK,r)) = 1. So in total, Bf^^ie) < max { I 

small e. On the other hand, if /i^ G C can be represented by a union of i — 1 (or fewer) intervals, then 
we can use the extra interval to disagree with h^, on a tiny region at an arbitrary location, while still 
remaining in B(/iz,r), so that DIS(B(/iz,?')) = (0,1), V{Bm{B {h^,r))) = 1, and 9h,{e) 
particular, in this case we have 6h^ = oo. 



i;m 



5.1.4 General Upper Bounds on the Label Complexity of Meta-Algorithm 2 

As mentioned, the disagreement coefficient has implications for the label complexities achievable 
by disagreement-based active learning. The intuitive reason for this is that, as the number of label 
requests increases, the diameter of the version space shrinks at a predictable rate. The disagreement 
coefficient then relates the diameter of the version space to the size of its region of disagreement, 
which in turn describes the probability of requesting a label. Thus, the expected frequency of label 
requests in the data sequence decreases at a predictable rate related to the disagreement coefficient, 
so that \C\m Meta-Algorithm 2 can be lower bounded by a functi on of the d i sagreement co efficient. 
Specifically, the following result was essentially established by Hanneke ( 2011 , 2007bl) . though 
actually the result below is slightly more general than the original. 

Theorem 10 For any VC class C, and any passive learning algorithm Ap achieving label com- 
plexity Ap, the active learning algorithm obtained by applying Meta-Algorithm 2 with Ap as input 
achieves a label complexity that, for any distribution V and classifier / G C, satisfies 

Ue,f,V) = 0(^ef {Ap{e/2,f,rr') log' ^^^^A/'^) ^ . 



The proof of Theorem [TOl is similar to the original result of iHannekd (|201ll . l2007br) . with only 
minor modifications to account for using Ap instead of returning an arbitrary element of V. The 
formal details are implicit in the proof of Theorem [T6l below (since Meta-Algorithm 2 is essentially 
identical to the k = 1 round of Meta-Algorithm 3, defined below). We also have the following 
simple corollaries. 

Corollary 11 For any VC class C, there exists a passive learning algorithm Ap such that, for every 
/ E C and distribution V, the active learning algorithm obtained by applying Meta-Algorithm 
2 with Ap as input achieves label complexity 

Ka{e, f,V) = {Of {e)\og\l / e)) . o 



Proof The one-inclusion graph algorithm of iHaussler. Littlestone. and WarmuthI (ll994|) is a passive 
learning algorithm achieving label complexity Ap{e, f,V) < d/e. Plugging this into Theorem [TOl 
using the fact that 9f{e/2d) < 2d9f{e), and simplifying, we arrive at the result. In fact, we will see 
in the proof of Theorem [16] that incurring this extra constant factor of d is not actually necessary. ■ 



Corollary 12 For any VC class C and distribution V, if^f € C, ^/ < oo, then {C,V) is learnable 
at an exponential rate. If this is true for all V, then C is learnable at an exponential rate. o 
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Proof The first claim follows directly from Corollary [TTl since 9f{e) < 9f. The second claim then 
follows from the fact that Meta- Algorithm 2 is adaptive to V (has no direct dependence on V except 
via the data). ■ 



Aside from the disagreement coefficient and Ap terms, the other constant factors hidden in the 
big-0 in Theorem [To] are only C-dependent (i.e., independent of / and V). As mentioned, if we are 
only interested in achieving the label complexity bound of Corollary [TTl we can obtain this result 
m ore directly by the simp ler original algorithm of ICohn. Atlas, and Ladnen (|l994|) via the analysis 
of lHannekd tolA l2007bl) . 



5.1.5 General Lower Bounds on the Label Complexity of Meta-Algorithm 2 

It is also possible to prove a kind of lower bound on the label complexity of Meta-Algorithm 2 in 
terms of the disagreement coefficient, so that the dependence on the disagreement coefficient in 
Theorem [TO] is unavoidable. Specifically, there are two simple observations that intuitively ex- 
plain the possibility of such lower bounds. The first observation is that the expected number 
of label requests Meta-Algorithm 2 makes among the first [1/r] unlabeled examples is at least 
V(DlS(B{f, r)))/ (2r) (assuming it does not halt first). Similarly, the second observation is that, to 
arrive at a region of disagreement with expected probability mass less than 'P(DIS(B(/, r)))/2, 
Meta-Algorithm 2 requires a budget n of size at least P(DIS(B(/, r)))/(2r). These observa- 
tions are formalized in Appendix [Cl as Lemmas [47] and [48l Noting that, for unbounded 9f{e), 
V(DlS(B{f,e)))/e 7^ o{6f{e)), the relevance of these observations in the context of deriving 
lower bounds based on the disagreement coefficient becomes clear In particular, we can use the lat- 
ter of these insights to arrive at the following theorem, which essentially complements Theorem [T0[ 
showing that it cannot generally be improved beyond reducing the constants and logarithmic fac- 
tors, without altering the algorithm or introducing additional ^p-dependent quantities in the label 
complexity bound. The proof is included in Appendix JC] 



Theorem 13 For any set of classifiers C, / G C, distribution V, and nonincreasing function A : 
(0, 1) —7- N, there exists a passive learning algorihtm Ap achieving a label complexity Ap with 
Ap(e, /, V) = X{e) for all e > 0, such that if Meta-Algorithm 2, with Ap as its argument, achieves 
label complexity A^, then 

Aa{eJ,V)^o{0f{Ap{2e,f,V)-')). ^ 



Recall that there are many natural learning problems for which 9f = 00, and indeed where 
9f{e) = for instance, intervals with / = /i[a,a] under uniform V, or unions of i intervals 

under uniform V with / representable as i — 1 or fewer intervals. Thus, since we have just seen that 
the improvements gained by disagreement-based methods are well-characterized by the disagree- 
ment coefficient, if we would like to achieve exponential improvements over passive learning for 
these problems, we will need to move beyond these disagreement-based methods. In the subsec- 
tions that follow, we will use an alternative algorithm and analysis, and prove a general result that is 
always at least as good as Theorem [TOl (in a big-0 sense), and often significantly better (in a little-o 
sense). In particular, it leads to a sufficient condition for learnability at an exponential rate, strictly 
more general than that of Corollary [12] 
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5.2 An Improved Activizer 

In this subsection, we define a new active learning method based on shattering, as in Meta- Algorithm 
1, but which also takes fuller advantage of the sequential aspect of active learning, as in Meta- 
Algorithm 2. We will see that this algorithm can be analyzed in a manner analogous to the disagree- 
ment coefficient analysis of Meta-Algorithm 2, leading to a new and often dramatically-improved 
label complexity bound. Specifically, consider the following meta-algorithm. 



Meta-Algorithm 3 

Input: passive algorithm Ap, label budget n 
Output: classifier h 

0. V ^Vo = C,To^ \2n/3] , i ^ 0, m ^ 

1. For A; = 1,2, ... ,d+ 1 

2. Let Ck ^ {}, Tk ^ Tk^i - t, and let t ^ 

3. While t < \Tk/A] and m < • 2" 

4. m m + \ 

5. If An {S G X^~^ : V shatters S [J {Xm]\V shatters S) > 1/2 

6. Request the label Ym of Xm, and let y -^Ym and t ^ t + 1 

7. Else let y argmax Pn,{S G X^-^:V[{X^, -y)] does not shatter S\V shatters S) 

y6{-i,+i} 

8. LetF^K^ = K^-l[(X„,y)] 

9. A^'^) ^ Prr,{x : P[S e X^-^ : V shatters S U {x}\V shatters S) > 1/2^ 

10. Do [rfc/(3A(^'))J times 

11. m ^ m + 1 

12. If Pm [S G X^-^ : V shatters S U {Xm]\V shatters 5) > 1/2 and t < [^Tk/A\ 

13. Request the label Ym of Xm, and let y ^ Ym and t t + 1 

14. Else, let y ^ argmax An (S" G X^~^ : ^ [(X^, -y)] does not shatter S\V shatters S) 

ye{-i,+i} 

15. Let Ck ^ Ck U {{Xm, y)} and V ^Vm = Vm-i \{XmM 

16. Return ActiveSelect({^p(£i),^p(£2), ■ ■ ■ ,^p(>Cd+i)}, [n/3j , {X^+i, • • •}) 



As before, the procedure is specified in terms of estimators Pm- Again, these can be defined in a 
variety of ways, as long as they converge (at a fast enough rate) to their respective true probabilities. 
For the results below, we will use the definitions given in Appendix IB. II i.e., the same definitions 
used in Meta-Algorithm 1. Following the same argument as for Meta-Algorithm 1, one can show 
that Meta-Algorithm 3 is a universal activizer for C, for any VC class C. However, we can also 
obtain more detailed results in terms of a generalization of the disagreement coefficient given below. 

As with Meta-Algorithm 1, this procedure has three main components: one in which we focus 
on reducing the version space V , one in which we focus on collecting a (conditionally) i.i.d. sample 
to feed into A^, and one in which we select from among the d+ 1 executions of A-p. However, unlike 
Meta-Algorithm 1, here the first stage is also broken up based on the value of k, so that each k has its 
own first and second stages, rather than sharing a single first stage. Again, the choice of the number 
of (unlabeled) examples processed in each second stage guai^antees (by a Chernoff bound) that the 
"t < [3Tfc/4j" constraint in Step 12 is redundant. Depending on the type of label complexity 
result we wish to prove, this multistage architecture is sometimes avoidable. In particular, as with 
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Corollary [TT] above, to directly achieve the label complexity bound in Corollary [17] below, we can 
use a much simpler approach that replaces Steps 9-16, instead simply returning an arbitrary element 
of V upon termination. 

Within each value of k, Meta-Algorithm 3 behaves analogous to Meta-Algorithm 2, requesting 
the label of an example only if it cannot infer the label from known information, and updating 
the version space V after every label request; however, unlike Meta-Algorithm 2, for values of 
A; > 1, the mechanism for inferring a label is based on shatterable sets, as in Meta-Algorithm 
1, and is motivated by the same argument of splitting V into subsets containing arbitrarily good 
classifiers (see the discussion in Section I4.1I ). Also unlike Meta-Algorithm 2, even the inferred 
labels can be used to reduce the set V (Steps 8 and 15), since they are not only correct but also 
potentially informative in the sense that x G DIS(F). As with Meta-Algorithm 1, the key to 
obtaining improvement guarantees is that some value of k has \Ck\ ^ n, while maintaining that 
all of the labels in Ck are correct; ActiveSelect then guarantees the overall performance is not too 
much worse than that obtained by Ap{Ck) for this value of k. 

To build intuition about the behavior of Meta-Algorithm 3, let us consider our usual toy ex- 
amples, again under a uniform distribution V on [0, 1]; as before, for simplicity we ignore the 
fact that Prn is only an estimate, as well as the constraint on t in Step 12 and the effectiveness 
of ActiveSelect, all of which will be addressed in the general analysis. First, for the behavior 
of the algorithm for thresholds and nonzero-width intervals, we may simply refer to the discus- 
sion of Meta-Algorithm 2, since the k = 1 round of Meta-Algorithm 3 is essentially identical to 
Meta-Algorithm 2; in this case, we have already seen that |£i| grows as exjp{Q{n/ log{n/e))} for 
thresholds, and does so for nonzero-width intervals after some initial period of slow growth related 
to the width of the target interval (i.e., the period before finding the first positive example). As with 
Meta-Algorithm 1 , for zero-width intervals, we must look to the k = 2 round of Meta-Algorithm 
3 to find improvements. Also as with Meta-Algorithm 1, for sufficiently large n, every Xm pro- 
cessed in the k = 2 round will have its label inferred (correctly) in Step 7 or 14 (i.e., it does not 
request any labels). But this means we reach Step 9 with m = 2 • 2" + 1; furthermore, in these 
circumstances the definition of Pm from Appendix IB. II guarantees (for sufficiently large n) that 
A(^) = 2/m, so that \C2\ on n ■ m = {n ■ 2"). Thus, we expect the label complexity gains to be 
exponentially improved compared to Ap. 

For a more involved example, consider unions of 2 intervals (Example [3]), under uniform V on 
[0, 1], and suppose / = a^f,) for 6 — a > 0; that is, the target function is representable as a 
single nonzero-width interval [a,h] C (0, 1). As we have seen, df = (0, 1) in this case, so that 
disagreement-based methods ai^e ineffective at improving over passive. This also means the k = I 
round of Meta-Algorithm 3 will not provide improvements (i.e., \Ci\ = 0{n)). However, consider 
the k = 2 round. As discussed in Section [421 for sufficiently large n, after the first round (k = 1) 
the set V is such that any label we infer in the k = 2 round will be correct. Thus, it suffices to 
determine how large the set £2 becomes. By the same reasoning as in Section [431 for sufficiently 
large n, the examples Xm whose labels are requested in Step 6 are precisely those not separated 
from both a and b by at least one of the m — 1 examples already processed (since V is consistent 
with the labels of all m — 1 of those examples). But this is the same set of points Meta-Algorithm 
2 would query for the intervals example in Section |5?T1 thus, the same argument used there implies 
that in this problem we have \C2\ > exp{r2(n/ log(n/e))} with probability 1 — e/2, which means 
we should expect a label complexity of O (log(Ap(e/2, /, P)) log(log(Ap(e/2, /, 7^))/e)), where 
Ap is the label complexity of Ap. For the case / = /i(a^a,a,a)' ^ = 3 is the relevant round, and 



35 



Hanneke 



the analysis goes similarly to the /i[a,a] scenario for intervals above. Unions of i > 2 intervals can 
be studied analogously, with the appropriate value of k to analyze being determined by the number 
of intervals required to represent the target up to probability-zero differences (see the discussion in 
Section|421). 

5.3 Beyond the Disagreement Coefficient 

In this subsection, we introduce a new quantity, a generalization of the disagreement coefficient, 
which we will later use to provide a general characterization of the improvements achievable by 
Meta- Algorithm 3, analogous to how the disagreement coefficient characterized the improvements 
achievable by Meta- Algorithm 2 in Theorem [TOl First, let us define the following generalization of 
the disagreement core. 

Definition 14 For an integer k > 0, define the /c-dimensional shatter core of a classifier f with 
respect to a set of classifiers % and distribution P as 

dl p/ = lim f 5 G : p(/, r) shatters s} . <> 

As before, when P = V, and V is clear from the context, we will abbreviate d!^f = dl^-pf, and 
when we also intend T-L = C, the full concept space, and C is clearly defined in the given context, 
we further abbreviate d'^f = d^f = d^j,f. We have the following definition, which will play a 
key role in the label complexity bounds below. 

Definition 15 For any concept space C, distribution V, and classifier f, V/c G N, Ve > 0, define 
Jk), , , „ V^S<-X^: B(/, r) shatters S) 

oy [e) = ly sup ■ 



Then define 



and 



r>e r 



minjA; £ N : V'' [d^ f 



)f{e) = ef\e). 



Also abbreviate of^ = ef\o) and = 9f{0). o 

We might refer to the quantity 6^''' (e) as the order-A; (or /c-dimensional) disagreement coeffi- 
cient, as it represents a direct generalization of the disagreement coefficient O f{e). However, rather 
than merely measuring the rate of collapse of the probability of disagreement (one-dimensional 
shatterabihty), ef\e) measures the rate of collapse of the probability of k-dimensional shatterabil- 

ity. In particular, we have 9f{e) = 9^j^^\e) < ^j^^(e) = ^/(£)> so that this new quantity is never 
larger than the disagreement coefficient. However, unlike the disagreement coefficient, we always 
have Of{e) = o(l/e) for VC classes C. In fact, we could equivalently define 9f{e) as the value 

of 9f\e) for the smallest k with 9f \e) = o{l/e). Additionally, we will see below that there are 

many interesting cases where 9f = oo (even 9f{e) = but 9f < oo (e.g, intervals with a 

zero-width target, or unions of i intervals where the target is representable as a union of f — 1 or 
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fewer intervals). As was the case for 6*^, we will see that showing 0j < oo for a given learning prob- 
lem has interesting implications for the label complexity of active learning (Corollary [T8]below). In 
the process, we have also defined the quantity df, which may itself be of independent interest in the 
asymptotic analysis of learning in general. For VC classes, df always exists, and in fact is at most 
d + 1 (since C cannot shatter any d+1 points). When d = oo, the quantity df might not be defined 
(or defined as oo), in which case ^/(e) is also not defined; in this work we restrict our discussion to 
VC classes, so that this issue never comes up; Section |7]discusses possible extensions to classes of 
infinite VC dimension. 

We should mention that the restriction of ^j(e) > 1 in the definition is only for convenience, as 
it simplifies the theorem statements and proofs below. It is not fundamental to the definition, and 
can be removed (at the expense of slightly more complicated theorem statements). In fact, this only 
makes a difference to the value of 6f{e) in some (seemingly unusual) degenerate cases. The same 
is true of 0/(e) in Definition |9l 

The process of calculating 6 / (e) is quite similar to that for the disagreement coefficient; we are 
interested in describing B(/, r), and specifically the variety of behaviors of elements of B(/, r) on 
points in X, in this case with respect to shattering. To illustrate the calculation of 9f{e), consider 
our usual toy examples, again under V uniform on [0, 1]. For the thresholds example (Example [Hi, 
we have df = 1, so that 6f{e) = G^pi^) = ^/(£)> which we have seen is equal 2 for small 
e. Similarly, for the intervals example (Example |2ll, any / = hya,b] £ C with 6 — a > has 

df = 1, so that Of{e) = (^^i^) = which for sufficiently small e, is equal max|^3^,4|. 

Thus, for these two examples, 6f{e) = 6f{e). However, continuing the intervals example, consider 
/ = h[a,a] e C. In this case, we have seen d^f = df = (0,1), so that Vid^f) = 1 > 0. 
For any xi,X2 € (0, 1) with < |xi — X2I < r, B(/, r) can shatter (xi,X2), specifically using 
the classifiers {/i[^^^^2],/i[^j^2,j],/i[^2^^2],/i[3.3^^3]} for any X3 G (0, 1) \ {xi, 2:2}. However, for 
any xi,X2 G (0, 1) with |xi — X2\ > r, no element of B(/, r) classifies both as +1 (as it would 
need width greater than r, and thus would have distance from /i[a,a] greater than r). Therefore, 
{S £ X'^ : B(/,r) shatters S} = {{xi,X2) G (0,1)^ : < |xi - X2\ < r}; this latter set has 
probability 2r(l — r) + = (2 — r) • r, which shrinks to as r — )• 0. Therefore, df = 2. 

Furthermore, this shows 6f{e) = 9j- (e) = supj.>g(2 — r) = 2 — e < 2. Contrasting this with 
9f{e) = we see 9 fie) is significantly smaller than the disagreement coefficient; in particular, 
6*/ = 2 < 00, while Of = 00. 

Consider also the space of unions of i intervals (Example |3]) under V uniform on [0, 1]. In 
this case, we have already seen that, for any f = hz € C not representable (up to probability- 
zero differences) by a uinon of i — 1 or fewer intervals, we have V{d^f) = V{df) = 0, so 

that df = 1, and Of = o'f^ — — max < \ — ^r^;^^ (■ To generalize this, suppose 

f = hz is minimally representable as a union of any number j < i of intervals of nonzero width: 
[21,^2] U [z3, Z4] U • • • U [z2j~i-, Z2j], with < zi < Z2 < ■ ■ ■ < Z2j < 1. For our purposes, this is 
fully general, since every element of C has distance zero to some of this type, and Oh = Oh' for 
any h, h' with V{x : h{x) / h'{x)) = 0. Now for any k < and any S = (xi, . . . , Xk) G 

with all elements distinct and no elements equal any of the Zp values, the set B(/, r) can shatter S, 
as follows. Begin with the intervals [2:2^-1, Z2p] as above, and modify the classifier in the following 
way for each labeling of S. For any of the xg values we wish to label +1, if it is already in an interval 
[z2p-i, Z2p], we do nothing; if it is not in one of the [z2p^i, Z2p] intervals, we add the interval [x£, xi] 
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to the classifier. For any of the xi values we wish to label —1, if it is not in any interval [z2p-i, Z2p\, 
we do nothing; if it is in some interval [z2p-i, Z2p\, we split the interval by setting to —1 the labels 
in a small region {xe — 7, X£ + 7), for 7 < min{r/A:, Z2p — Z2p-i} chosen small enough so that 
{xe — 'y,X£ + 7) does not contain any other element of 5. These operations add at most k new 
intervals to the minimal representation of the classifier as a union of intervals, which therefore has at 
most j+k < i intervals. Furthermore, the classifier disagrees with / on a set of size at most r, so that 
it is contained in B(/, r). We therefore have V'^{S G : B(/, r) shatters 5) = 1. However, note 
that for < r < min Zp^i — Zp, for any k and 5 G with all elements of Su{zp : 1 < p < 2j} 

l<p<2j 

separated by a distance greater than r, classifying the points in S opposite to / while remaining r- 
close to / requires us to increase to a minimum of j + k intervals. Thus, for A; = i — j + 1, any 
S = {xi, . . . , Xk) S with min \yi — y2\ > f is not shatterable by B(/, r). We 



yi,y2<^SU{zp}p:yijty: 



'2 



therefore have {5 G X'' : B(/,r) shatters S} C \ S e X'' : min \yi — y2\ ^ r >. 

t yi,y2'^SU{zp}p:yi=/=y2 J 

For r < min Zp^i — Zp, we can bound the probability of this latter set by considering sampling 

l<p<2j 

the points X£ sequentially; the probability the i?*^^ point is within r of one of xi , . . . , X£_i, zi, . . . , Z2j 
is at most 2r(2j + ^ — 1), so (by a union bound) the probability any of the k points xi, . . . ,Xk is 

within r of any other or any of zi, . . . , Z2j is at most X]^=i 2?^(2i + £ — 1) = 2r (^2jk + (2)^ = 

(1 + i — + 3j)r. Since this approaches zero as r — )• 0, we have dj = i — j + 1. Furthermore, 

this analysis shows 6f = 9^r ■'^^^ < max < ] — (1 + i — + 3j) >. In fact, careful 

further inspection reveals that this upper bound is tight (i.e., this is the exact value of 6f). Recalling 
that Of{e) = l/e for j < i, we see that again Of{e) is significantly smaller than the disagreement 
coefficient; in particular, 6f < 00 while 6*/ = 00. 

Of course, for the quantity 9f{e) to be truly useful, we need to be able to describe its behavior for 
families of learning problems beyond these simple toy problems. Fortunately, as with the disagree- 
ment coefficient, for learning problems with simple "geometric" interpretations, one can typically 
bound the value of 9f without too much difficulty. For instance, consider X the surface of a unit 
hypersphere in p-dimensional Euclidean space (with p > 3), with V uniform on X, and C the space 
of hnear _se p_arators: C = {/j,w/,(x) = _ s^(w • x + 6) : w g W. b g R}. Balcan. Hanneke, and 
Vaughan (120 loh proved that (C,7^) is leamable at an exponential rate, by a specialized argument 
for this space. In the process, they established that for any / € C with V{x : f{x) = +1) G (0, 1), 
6f < 00; in fact, a similar argument shows 9f < A-i^^j vaivLyVix : f{x) = y). Thus, in this 
case, df = 1, and 9f = 9f < 00. However, consider / G C with V{x : f{x) = y) = 1, for some 
y G {— 1, +1}. In this case, every /i G C with V{x : h{x) = —y) < r has V{x : h{x) 7^ /(x)) < r 
and is therefore contained in B(/, r). In particular, for any x £ X, there is such an h that dis- 
agrees with / on only a small spherical cap containing x, so that DIS(B(/, r)) = X for all 
r > 0. But this means df = X, which implies 9f{e) = l/e and dj > 1. However, let 
us examine the value of 9^^ . Let Ap = ^^py denote the surface ai^ea of the unit sphere in 

M*', and let Cp{z) = ^Apl2z^z'^ (^^^' l) denote the surface area of a spherical cap of height 



Lil I2OI ih . where Ix{a,b) = r(a)r(b) /o^ ^" ^(^ ~ is the regularized incomplete beta 



function. In particular, since , < ip_\\l,^s < \\Jp — 2, the probabihty mass 
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I /p-lxp/iN Jq^ ^ t'''2 {1 — t) 2 dt contained in a spherical cap of height z satisfies 

^ >-\T7: t—dt=J~ ^ >- =L , (2) 



- 2 V 12 Jo V 12 p-l ~ VT2p 

and letting z = minjz, 1/2}, also satisfies 



CJz) 2C„ (z) 1 I P-3 , , 

/in An Z 







< \ t—dt = -^ (2z-z^)—<^ ^ < r- • (3) 

" P-l ~ VW6 





Consider any linear separator h G B(/, r) for r < 1/2, and let z{h) denote the height of the 
spherical cap where h{x) = —y. Then Q indicates the probability mass in this region is at 

least _ Since h G B(/, r), we know this probability mass is at most r, and we 

2 

therefore have 2z{h) — z{h) < [^/12pr) ^^'^ . Now for any xi £ X, the set of X2 £ X for 
which B(/, r) shatters {xi,X2) is equivalent to the set DIS({/i G B(/, r) : = —y})- 

But if = —y, then xi is in the aforementioned spherical cap associated with h. A lit- 

tle trigonometry reveals that, for any spherical cap of height z{h), any two points on the sur- 

face of this cap are within distance 2^j2z{h) — z{hY < 2 (-^12pr) p-^ of each other. Thus, 

for any point X2 further than 2 (^/12pr) from xi, it must be outside the spherical cap asso- 
ciated with h, which means h{x2) = y- But this is true for every h G B(/, r) with h{xi) = —y, 

so that DIS({/i G B(/, r) : h{xi) = —y}) is contained in the spherical cap of all elements of 

1 

X within distance 2 (^yJ12pr) of xi, a little more trigonometry reveals that the height of this 

2 

spherical cap is 2 iy^/V2pr) p-^ . Then ^ indicates the probability mass in this region is at most 
^"'^^^ = 2P\/T8r. Thus, V^{{xi,X2) : B(/,r) shatters (xi,X2)) = / P(DIS({/i G B(/,r) : 



h{xi) = —y}))V{dxi) < 2P^/\^r. In particular", since this approaches zero as r — )• 0, we have 

(2) / 

df = 2. This also shows that 9j- = 9y < 2Pv 18, a finite constant (albeit a rather large one). 
Following similar reasoning, using the opposite inequalities as appropriate, and taking r sufficiently 
small, one can also show 9f> 2^/(12\/2)- 

5.4 Bounds on the Label Complexity of Activized Learning 

We have seen above that in the context of several examples, Meta-Algorithm 3 can offer signif- 
icant advantages in label complexity over any given passive learning algorithm, and indeed also 
over disagreement-based active learning in many cases. In this subsection, we present a general re- 
sult characterizing the magnitudes of these improvements over passive learning, in terms of 6f{e). 
Specifically, we have the following general theorem, along with two immediate corollaries. The 
proof is included in Appendix |Dl 
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Theorem 16 For any VC class C, and any passive learning algorithm Ap achieving label complex- 
ity Ap, the (Meta-Algorithm 3)-activized Ap algorithm achieves a label complexity that, for any 
distribution V and classifier / S C, satisfies 

KieJ,V) = O (^Oj {Ap(e/A,f,vr') log' MeIMiH^ . 



Corollary 17 For any VC class C, there exists a passive learning algorithm Ap such that, for 
every f £ C and distributions V, the (Meta-Algorithm 3)-activized Ap algorithm achieves label 
complexity 

Aa{e,f,V) = o(efie)log\l/e) 



Proof The one-inclusion graph algorithm of iHaussler. Littlestone. and WarmuthI (11994) is a passive 
learning algorithm achieving label complexity Ap(e, f,V) < d/e. Plugging this into Theorem [T6l 
using the fact that 9f{e/M) < Ad6f{e), and simplifying, we arrive at the result. In fact, in the proof 
of Theorem [161 we see that incuning this extra constant factor of d is not actually necessary. ■ 



Corollary 18 For any VC class C and distribution V, ?/V/ G C, ^/ < oo, then (C, is leamable 
at an exponential rate. If this is true for all V, then C is leamable at an exponential rate. o 

Proof The first claim follows directly from Corollary [iTl since 9f{e) < Of. The second claim then 
follows from the fact that Meta-Algorithm 3 is adaptive to V (has no direct dependence on V except 
via the data). ■ 



Actually, in the proof we arrive at a somewhat more general result, in that the bound of The- 
orem [16] actually holds for any tai^get function / in the "closure" of C: that is, any / such that 
Vr > 0, B(/, r) ^ 0. As previously mentioned, if our goal is only to obtain the label complexity 
bound of Corollary [it] by a direct approach, then we can use a simpler procedure (which cuts out 
Steps 9-16, inste ad returning an arbitrary e l ement of V), analogous to how the analysis of the orig- 
inal algorithm of ICohn. Atlas, and Ladnerl (119941) by iHannekd (1201 lb obtains the label complexity 
bound of Corollary [U (see also Algorithm 5 below). However, the general result of Theorem [T6] is 
interesting in that it applies to any passive algorithm. 

Inspecting the proof, we see that it is also possible to state a result that separ ates th e proba- 
bility of success from the achi eved error rate, similar to the PAC mod el of iValianti (119841) and the 
analysis of active learning by iBalcan. Hanneke. and Vaughan (l20ld) . Specifically, suppose Ap 
is a passive learning algorithm such that, Ve,(5 G (0, 1), there is a value X{e,6, f,V) G N such 
that Vn > X{e, 6, f,V), F (er {Ap{Zn)) > e) < 5. Suppose /i„ is the classifier returned by the 
(Meta-Algorithm 3)-activized Ap with label budget n. Then for some (C, /) -dependent constant 
c G [1, oo), Ve, 5 G (0, e'^), letting A = A(e/2, 5/2, f, V), 



Vn > cOf (A-^) log2 (A/(5) , P (er (k) > e) < 5. 



For instance, if Ap is an empirical risk minimization algorithm, then this is oc 0j(e)polylog 
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5.5 Limitations and Potential Improvements 

Theorem [16] and its corollaries represent significant improvements over most known results for 
the label complexity of active learning, and in particular over Theorem [10] and its corollaries. As 
for whether this also represents the best possible label complexity gains achievable by any active 
learning algorithm, the answer is mixed. As with any algorithm and analysis, Meta-Algorithm 
3, Theorem [16] and corollaries, represent one set of solutions in a spectrum that trades strength 
of performance guarantees with simplicity. As such, there are several possible modifications one 
might make, which could potentially improve the performance guarantees. Here we sketch a few 
such possibilities. 

Even with Meta-Algorithm 3 as-is, various improvements to the bound of Theorem [16] should 
be possible, simply by being more careful in the analysis. For instance, as mentioned, Meta- 
Algorithm 3 is a universal activizer for any VC class C, so in particular we know that whenever 
Of{e) ^ o (1/ felogfl/e)!). the above bound is not tight (see the work of Balcan. Hanneke, and 
VaughanTioiJ for a construction leading to such 6f{e) values), and indeed any bound of the form 



9 f {e)polylog{l / e) will not be tight in that case. Again, a more refined analysis may close this gap. 

Another type of potential improvement is in the constant factors. Specifically, in the case when 
9f < oo, if we are only interested in asymptotic label complexity guarantees in Corollary [TTJ we can 
replace "sup" in Definition [15] with "limsup," which can sometimes be significantly smaller and/or 

r>0 r-5>0 

easier to study. This is true for the disagreement coefficient in Corollary [TT] as well. Additionally, 
the proof (in Appendix [D]) reveals that there are significant (C, V, /)-dependent constant factors 
other than 9f{e), and it is quite likely that these can be improved by a more careful analysis of 
Meta-Algorithm 3 (or in some cases, possibly an improved definition of the estimators 

However, even with such refinements to improve the results, the approach of using ^/ to prove 
leamability at an exponenti al rate has limits. For instance, it i s know n that any countable C is leam- 
able at an exponential rate ( Balcan. Hanneke. and VaughanL 2010 ). However, there are countable 



VC classes C for which 9f = oo for s ome elements of C (e.g., take the tree-paths concept space of 
Balcan. Hanneke. and VaughanI (2010), except instead of all infinite-depth paths from the root, take 



all of the finite-depth paths from the root, but keep one infinite-depth path /; for this modified space 
C, which is countable, every /i S C has dh = 1, and for that one infinite-depth / we have 9f = oo). 

Inspecting the proof reveals that it is possible to make the results slightly sharper by replacing 
df{fo) (for ro as in the results above) with a somewhat more complicated quantity: namely, 

min sup r^^ ■ V (x G X : V'' (s £ : B(/, r) shatters S U {x}) > F (d^'f) /w) . (4) 

k<df r>ro ^ ^ J \ J J 

This quantity can be bounded in terms of ^/(ro) via Markov's inequality, but is sometimes smaller. 

As for improving Meta-Algorithm 3 itself, there are several possibilities. One immediate im- 
provement one can make is to repace the condition in Steps 5 and 12 by mini<j<jt Pm{S G : 
V shatters 5U shatters S) > 1/2, likewise replacing the corresponding quantity in Step 9, 

and substituting in Steps 7 and 14 the quantity maxi<j<fc Pm{S € : V[{Xm, —y)] does not 

shatter S\V shatters S); in particular; the results stated for Meta-Algorithm 3 remain valid with this 
substitution, requiring only minor modifications to the proofs. However, it is not clear what gains 
in theoretical guarantees this achieves. 

Additionally, there are various quantities in this procedure that can be altered almost arbitrarily, 
allowing room for fine-tuning. Specifically, the 2/3 in Step and 1/3 in Step 16 can be set to 
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arbitrary constants summing to 1. Likewise, the 1/4 in Step 3, 1/3 in Step 10, and 3/4 in Step 12 
can be changed to any constants in (0, 1), possibly depending on k, such that the sum of the first 
two is strictly less than the third. Also, the 1/2 in Steps 5, 9, and 12 can be set to any constant in 
(0, 1). Furthermore, the k • 2" in Step 3 only prevents infinite looping, and can be set to any function 
growing superlinearly in n, though to get the largest possible improvements it should at least grow 
exponentially in n; typically, any active learning algorithm capable of exponential improvements 
over reasonable passive learning algorithms will require access to a number of unlabeled examples 
exponential in n, and Meta-Algorithm 3 is no exception to this. 

One major issue in the design of the procedure is an inherent trade-off between the achieved 
label complexity and the number of unlabeled examples used by the algorithm. This is notewor- 
thy both because of the practical concerns of gathering such large quantities of unlabeled data, and 
also for computational efficiency reasons. In contrast to disagreement-based methods, the design 
of the estimators used in Meta-Algo r ithm 3 introduces such a trade-off, though in contrast to the 



splitting index analysis of iDasguptal (120051) . the trade-off here seems only in the constant factors. 
The choice of these Pm estimators, both in their definition in Appendix IB.1[ and indeed in the 
very quantities they estimate, is such that we can (if desired) limit the number of unlabeled exam- 
ples the main body of the algorithm uses (the actual number it needs to achieve Theorem [16] can 
be extracted from the proofs in Appendix ID. lb . However, if the number of unlabeled examples 
used by the algorithm is not a limiting factor, we can suggest more effective quantities. Specif- 
ically, following the original motivation for using shatterable sets, we might consider a greedily- 
constructed distribution over the set {S* G : V shatters S,l < j < k, and either j = k — I or 
V{s : V shatters S U {s}) = 0}. We can construct the distribution implicitly, via the follow- 
ing generative model. First we set S = {}. Then repeat the following. \i \S\ = k — I or 
V{s ^ X : V shatters S U {s}) = 0, output S\ otherwise, sample s according to the conditional 
distribution of X given that V shatters S U {X]. If we denote this distribution (over S) as Vk, then 
replacing the estimator Pm [S £ X^^^ : V shatters S U shatters S) in Meta-Algorithm 

3 with an appropriately constructed estimator of Vk{S : V shatters S U {Xm}) (and similarly re- 
placing the other estimators) can lead to some improvements in the constant factors of the label 
complexity. However, such a modification can also dramatically increase the number of unlabeled 
examples required by the algorithm, since determining whether V{s £ X : V shatters SU {s}) « 
can be costly. 

Unlike Meta-Algorithm 1 , there remain serious efficiency concerns surrounding Meta-Algorithm 
3. If we knew the value of df and df < clog2(d) for some constant c, then we could potentially 
design an efficient version of Meta-Algorithm 3 still achieving Corollary [T7] Specifically, suppose 
we can find a classifier in C consistent with any given sample, or determine that no such classifier 
exists, in time polynomial in the sample size (and d), and also that Ap efficiently returns a classifier 
in C consistent with the sample it is given. Then replacing the loop of Step 1 by simply running 
with k = df and returning Ap{C^^), the algorithm becomes efficient, in the sense that with high 
probability, its running time is poly{d/e), where e is the eiTor rate guarantee from inverting the 
label complexity at the value of n given to the algorithm. To be clear; in some cases we may obtain 
values m tx exp{Q{n)}, but the error rate guaranteed by Ap is 0{l/m) in these cases, so that we 
still have m polynomial ind/e. However, in the absence of this access to df, the values of A; > in 
Meta-Algorithm 3 may reach values of m much larger than poly{d/e), since the error rates obtained 
from these Ap{Ck) evaluations are not guaranteed to be better than the Ap{C^^,) evaluations, and 
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yet we may have » '^^df\- Thus, there remains a challenging problem of obtaining the results 
above (Theorem [16] and Corollary [TT]) via an efficient algorithm, adaptive to the value of d/. 

6. Toward Agnostic Activized Learning 

The previous sections addressed learning in the realizable case, where there is a perfect classifier 
/ S C (i.e., er(/) = 0). To move beyond these scenarios, to problems in which / is not a perfect 
classifier (i.e., stochastic labels) or not well-approximated by C, requires a change in technique to 
make the algorithms more robust to such issues. As we will see in Subsection I6.2[ the results we 
can prove in this more general setting are not quite as strong as those of the previous sections, but in 
some ways they are more interesting, both from a practical perspective, as we expect real learning 
problems to involve imperfect teachers or underspecified instance representations, and also from a 
theoretical perspective, as the class of problems addressed is significantly more general than those 
encompassed by the realizable case above. 

In this context, we will be largely interested in more general versions of the same types of 
questions as above, such as whether one can activize a given passive learning algorithm, in this 
case guaranteeing strictly improved label complexities for all nontrivial joint distributions over 
Afx{— 1,+1}. In Subsection 16.31 we present a general conjecture regai^ding this type of strong 
domination. At the same time, to approach such questions, we will also need to focus on developing 
techniques to make the algorithms robust to label noise. For this, we will use a natural generalization 
of techniques developed for noise-robust disagreement-based active learning, analogous to how we 
generalized Meta- Algorithm 2 to anive at Meta- Algorithm 3 above. For this purpose, as well as for 
the sake of comparison, we will review the known techniques and results for disagreement-based 
agnsotic active learning in Subsection [63] We then extend these techniques in Subsection l6.6l to de- 
velop a new type of agnostic active learning algorithm, based on shatterable sets, which relates to the 
disagreement-based agnostic active leai^ning algorithms in a way analogous to how Meta- Algorithm 
3 relates to Meta- Algorithm 2. Furthermore, we present a bound on the label complexities achieved 
by this method, representing a natural genera lization of both Corollarv [TV] and the known results on 
disagreement-based agnostic active learning ( Hanneke . 201lb . 



Although we present several new results, in some sense this section is less about what we know 
and more about what we do not yet know. As such, we will focus less on presenting a complete 
and elegant theory, and more on identifying potentially promising directions for exploration. In 
particular. Subsection 16. 81 sketches out some interesting directions, which could potentially lead to 
a resolution of the aforementioned general conjecture from Subsection 16.31 

6.1 Definitions and Notation 

In this setting, there is a joint distribution VxY on X x { — 1,+1}, with marginal distribution V 
on X. For any classifier h, we denote by er(/i) = VxY{{x,y) : h{x) ^ y). Also, denote by 
^*{'Pxy) = inf er(/i) the Bayes error rate, or simply v* when Vxy is clear from the 

context; also define the conditional label distribution rj{x;VxY) = ^(X = = x), where 

(X, y) ~ Vxy, or rj{x) = rj{x;VxY) when Vxy is clear from the context. For a given concept 
space C, denote y{C;VxY) = inf er(/i), called the noise rate of C; when C and/or Vxy is clear 

from the context, we may abbreviate v = i^(C) = i^(C; Vxy)- For C C, the diameter is defined 
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as diam{'H;V) = sup V{x : hi{x) / h2{x)). Also, for any e > 0, define the e-minimal set 

C(e; Vxy) = {/i S C : er(/i) < v + e}. For any set of classifiers T-L, define the closure, denoted 
c\{U] V), as the set of all measurable h : X {-I, +1} such that Vr > 0, B'u,v{h, r) / 0. When 
Vxy is clear from the context, we will simply refer to C(e) = C(e; Vxy), and when V is cleai^, we 
write diam(?^) = A\sia{n;V) and c\{n) = cl(n;V). 

In the noisy setting, rather than being a perfect classifier, we will let / denote an arbitrary 
element of cl(C;P) wither(/) = v{C;Vxy)- that is, / G fl d {C{e]V xy)-,V) . Such a classifier 

e>0 

must exist, since cl(C) is compact in the pseudo-metric p{h,g) = f \h — g\dV oc V{x : h{x) ^ 
g{x) ) (in the usual sense of the equivalence cla sses being compact in the p-induced metric). This can 
be seen by recalling that C is totally bounded ('Haussler , 1992 ). and thus so is cl(C), and that cl(C) 



is a cl osed subset of {V), which is complete (Dudleyi 2002h so cl(C) is also cornplete (Munkres , 



2000 ). Total boundedness and completeness together imply compactness ( Munkresl 2000), and this 



implies the existence of / s ince monc j tone s equences of nonempty closed subsets of a compact space 



have a nonempty limit set (iMunkresl . 120001) . 

As before, in the learning problem there is a sequence Z = {{Xi,Yi), {X2, Y2), . . .}, where 
the {Xi,Yi) are independent and identically distributed, and we denote by Zm = {{Xi,Yi)}V]^^. As 
before, the Xi ~ V, but rather than having each Yi value determined as a function of Xj, instead 
we have each pair {Xi,Yi) ~ Vxy- The learning protocol is defined identically as above; that 
is, the algorithm has direct access to the Xi values, but must request the Yi (label) values one at a 
time, sequentially, and can request at most n total labels, where n is a budget provided as input to 
the algorithm. The label complexity is now defined just as before (Definition [T]), but generalized 
by replacing {f,V) with the joint distribution Vxy- Specifically, we have the following formal 
definition, which will be used throughout this section (and the corresponding appendices). 

Definition 19 An active learning algorithm A achieves label complexity A(-, •) if, for any joint 
distribution VxY, for any e G (0, 1) and any integer n > A(e, Vxy), we have E [er (,A(n))] < e. 

o 

However, because there may not be any classifier with error rate less than any arbitrary e G (0, 1), 
our objective changes here to achieving eiTor rate at most i/ + e for any given e G (0, 1). Thus, we 
are interested in the quantity K{v + e,VxY), and will be particularly interested in this quantity's 
asymptotic dependence on e, as e — )• 0. In particular, A(e, Vxy) may often be infinite for e < v. 

The label complexity for passive learning can be generalized analogously, again replacing (/, V) 
by Vxy in Definition |2] as follows. 

Definition 20 A passive learning algorithm A achieves label complexity A(-, •) if, for any joint 
distribution Vxy, for any e G (0, 1) and any integer n > A(e, Vxy), we have E [er {A {Zn))] < £- 



For any label complexity A in the agnostic case, define the set Nontrivial(A; C) as the set 
of all distributions Vxy on X x {-1, +1} such that Ve > 0, A(z/ + e,VxY) < 00, and V5 G 
Polylog(l/e), A{u + e,VxY) = ^{g{^))- In this context, we can define an activizer for a given 
passive algorithm as follows. 
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Definition 21 Vfe say an active meta-algorithm Aa activizes a passive algorithm Ap for C in 
the agnostic case if the following holds. For any label complexity Ap achieved by Ap, the ac- 
tive learning algorithm Aa{Ap, •) achieves a label complexity such that, for every distribution 
VxY G Nontrivial(Ap; C), there exists a constant c G [1, oo) such that 

Aa{iy + ce,VxY) = o{Ap{i^ + e,VxY)) ■ 

In this case, Aa is called an activizer /or Ap with respect to C in the agnostic case, and the active 
learning algorithm Aa{Ap, •) is called the ^a-activized Ap. o 

6.2 A Negative Result 

First, the bad news: we cannot generally hope for universal activizers for VC classes in the agnostic 
case. In fact, there even exist passive algorithms that cannot be activized, even by any specialized 
active learning algorithm. 

Specifically, consider again Example [T] where X = [0, 1] and C is the class of threshold 
classifiers, and let Ap be a passive learning algorithm that behaves as follows. Given n points 
Zn = {{Xi,Yi), {X2,Y2), {Xn,Yn)}, Ap{Zn) rctums the classifier £ C, where z = 

and VO = ( V |) A |, taking % = 1/8 if G {1, . . . , n} : X, = 0} = 0. 
For most distributions Vxv, this algorithm clearly would not behave "reasonably," in that its error 
rate would be quite large; in particular, in the realizable case, the algorithm's worst-case expected 
eiTor rate does not converge to zero as n — )• oo. However, for certain distributions Vxv engineered 
specifically for this algorithm, it has neai^-optimal behavior in a strong sense. Specifically, we have 
the following result, the proof of which is included in Appendix lE.il 

Theorem 22 There is no activizer for Ap with respect to the space of threshold classifiers in the 
agnostic case. o 

Recall that threshold classifiers were, in some sense, one of the simplest scenarios for activized 
learning in the realizable case. Also, since threshold-like problems are embedded in most "geo- 
metric" concept spaces, this indicates we should generally not expect there to exist activizers for 
ai^biti'ary passive algorithms in the agnostic case. However, this leaves open the question of whether 
certain families of passive learning algorithms can be activized in the agnostic case, a topic we turn 
to next. 

6.3 A Conjecture: Activized Empirical Risk Minimization 

The counterexample above is interesting, in that it exposes the limits on generality in the agnostic 
setting. However, the passive algorithm that cannot be activized there is in many ways not very rea- 
sonable, in that it has suboptimal worst-case expected excess error rate (among other deficiencies). 
It may therefore be more interesting to ask whether some family of "reasonable" passive learning 
algorithms can be activized in the agnostic case. It seems that, unlike Ap above, certain passive 
leai^ning algorithms should not have too peculiar a dependence on the label noise, so that they use 
Yi to help determine f{Xi) and that is all. In such cases, any Yi value for which we can already 
infer the value f{Xi) should simply be ignored as redundant information, so that we needn't request 
such values. While this discussion is admittedly vague, consider the following formal conjecture. 
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Recall that an empirical risk minimization algorithm for C is a type of passive learning algorithm 

A, characterized by the fact that for any set C € Um('^ ^ +1})"*. -^(^) G argminer£(/i). 

hac 

Conjecture 23 For any VC class, there exists an active meta-algorithm Aa and an empirical risk 
minimization algorithm Apfor C such that Aa activizes Apfor C in the agnostic case. o 

Resolution of this conjecture would be interesting for a variety of reasons. If the conjecture 
is coiTcct, it means that the vast (and growing) literature on the label complexity of empirical risk 
minimization has direct implications for the potential performance of active learning under the same 
conditions. We might also expect activized empirical risk minimization to be quite effective in 
practical applications. 

While this conjecture remains open at this time, the remainder of this section might be viewed 
as partial evidence in its favor, as we show that active learning is able to achieve improvements over 
the known bounds on the label complexity of passive learning in many cases. 



6.4 Low Noise Conditions 

In the subsections below, we will be interested in stating bounds on the label complexity of active 
learning, analogous to those of Theorem [lO] and Theorem [161 but for learning with label noise. 
As in the realizable case, we should expect such bounds to have some explicit dependence on 
the distribution VxY- Initially, one might hope that we could state interesting label complexity 
bounds purely in terms of a simple quantity such as i'{C;Vxy)- However, it is known that any 
label complexity bound for a nontrivial C (for either passive or active) depending on VxY only via 
u{C;Vxy) will be Q {e-^) when u{C;Vxy) > dKaariaineni |2006|) . Since p assive learn i ng ca n 



achieve a "Pxy-independent O (e ^) label complexity bound for any VC class (lAlexanden, 119841) . 



we will need to discuss label complexity bounds that depend on VxY via more detailed quantities 
than merely i/(C; Vxy) if we are to characterize the improvements of active leai^ning over passive. 

In this subsection, we review an index commonly used to des cribe certain properties of Vxy 
relative to C: namely, the Marnmen- Tsybakov margin conditions ( Mammen and Tsybakov , 19991 : 
TsvbakovL l2()04l: iKoltchinskiiL l2006h . Specifically, we have the following formal condition from 
Koltchinskiil t00(h . 



Condition 1 There exist constants fj,,K£ [1, oo) such that Ve > 0, diam(C(e; Vxy)',^) < fi- e~' 



This condition has recently been studied in depth in the passive leai^ning literature, as it can be 
used to characterize scenarios where the label cor nplexity of passive learning is between the worst- 
case 0(l/g^) and the reaUzable case &(l/e) (e .g., lMammen and Tsybakovl.ll999l : lTsybakovll2004 : 
Koltchinskil 20061 : Massart and Nedelec . 200a). The condition is implied by a variety of interesting 
special cases. For instance, it is satisfied when 



V, K G [1, oo) s.t. V/i G C, er(/i) - u{C; Vxy) > /i' • V{x : h{x) / f{x))''. 
It is also satisfied when z^(C; Vxy) = i^*{Vxy) and 

3fi",ae (0,oo) s.t. Ve > 0,V{x : \v{x;Vxy) - 1/2| < e) < /i" • 
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where k and are functions of a and fi" ( Mammen and Tsybakov , 19991 : Tsybakov , 2004 ): in 
particular, k = {1 + a)/a. Special cases of this condition have also been studied in depth; for 
instance, bounded noise conditio ns, wherein ij(C\Vxy) = v*{ ' Pxy) and Vx, \'n{x\ Vxy) — 1/2| > 



c for some constant c > (e.g., iGine and Koltchinskiil 120061: iMassart and Nedeled 120061) . are a 
special case of Condition [T] with k = 1. 

Condition [T] can be interpretted in a variety of ways, depending on the context. For instance, 
in certain concept spaces with a geometric interpretation, it can often be realized as a kind of large 
margin condition, under some condition relating the noisiness of a point's label to its distance from 
the optimal decision surface. That is, if the magnitude of noise (1/2 — \ri{x;VxY) — 1/2|) for 
a given point depends inversely on its distance from the optimal decision surface, so that points 
closer to the decision surface have noisier labels, a small value of k in Condition [T] will occur if the 
distri bution V has low density near the opti mal decision surface (assuming z^(C; Vxy) = t^*{Vxy)) 
(e.g., Dekel. Gentile, and Sridharan , 2010l) . On the other hand, when there is high density near the 
optimal decision surface, the value of k may be determined by h ow quickly ri{x; Vxy) changes as 
X approa ches the decision boundary (ICa stro and Nowakl, I2OO8I). See the works of Mammen and 
Tsvba koy (Il999l): iTsvbakovl (l2004):lKoltchinskii (2006,): .Massart and Nedelec (2006): Castro and 



Nowak (120081) : iDekel. Gentile, and SridharanI (|2010|) : iBartlett. Jordan, and McAuliffel (120061) for 
further interpretations of Condition [T] 

In the context of passive learning, one natural method to study is that of empirical risk minimiza- 
tion. Recall that a passive learning algorithm A is called an empirical risk minimization algorithm 
for C if it returns a classifier from C making the minimum number of mistakes on the labeled sam- 
ple it is given as input. It is known that for any VC class C, for any Vxy satisfying Condition [Ufor 
finite ^ and n, every empirical risk minimization algorithm for C achieves a label complexity 



K{u + e,VxY) = 



log- 

e 



(5) 



This follows from the works of Koltchinskiil (2006) and Massart and Nedelec ( 20061) . Furthermore, 
for nontrivial concept spaces, one can show that infA sup-p^^ K{v + e; Vxy) = ^ (^^^) ' where 
the supremum ranges over all Vxy satisfying Condition [T] for the given /x and n values, and the 
infimu m ranges over all label complexities achievable by passive learning algorithms ( Castro and 



Nowak. I2OO8I : iHannekd . l201 ih : that is, the bound ^ cannot be significantly improved by any pas- 
sive algorithm, without allowing the label complexity to have a more refined dependence on Vxy 
than afforded by Condition [T] 

In the context of active learning, a variety of results are presently known, which in some cases 
show improvements over Specifically, for any VC class C and any Vxy satisfying Condition [T] 
a certain noise-robust disagreement-based active leai^ning algorithm achieves label complexity 



K{y + e,VxY) = O 



e 



(6) 



This general result w as established bv Hannekj (201lb (analyzing the al gorithm of Dasgupta, Hsu, 
and Monteleoni (i2007l)). g e nerali zing earlier C-specific results by ICastr o and Nowak (2008) and 
Balcan. Broder. and Zhang (l2007h. and was later simpUfied and refined in some cases bv Koltchin- 



skii ^2010|). Comparing this to when 6f < 00 this is an improvement over passive learning 
by a factor of • log(l/e). Note that this generalizes the label complexity bound of Corol- 
lary [TT] above, since the realizable case entails Condition [T] with k = ;u/2 = 1. Itis also known 
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that this type of improvement is essentially the best we can hope for when we describe Vxy 
purely in terms of the parameters of Condition [T] Specifically, for any nontrivial concept space 
C, infA sup-p^y A(z/ + e, Vxy) = ^ ^max je^"^, log , where the supremum ranges over all 
Vxy satisfying Condition [T] for the given /i and k valu es, and the infimum ranges over all lab el 
complexities achievable by active learning algorithms (^Hannekd. l201 ll : IC astro and Nowakl. l2008h . 



In the following subsection, we review the established techniques and results for disagreement- 
based agnos tic active learnin g; the algorithm presented there is sli ghtly different fr om that originally 
analyzed by lHannekd (|201lh . but the label complexity bounds of lHannekd (|2011|) hold for this new 
algorithm as well. We follow this in Subsection 16.71 with a new agnostic active learning method 
that goes beyond disagreement-based learning, again generalizing the notion of disagreement to the 
notion of shatterability; this can be viewed as analogous to the generalization of Meta-Algorithm 
2 represented by Meta-Algorithm 3, and as in that case the resulting label complexity bound replaces 
Ofi-) with^/(-). 

For both passive and active learning, results under C ondition [J are also known for rnore general 
scenarios than VC classes: narnely, entropy conditions (IMarnmen and Tsvbakovl. ll999l: lTsvbakov , 



2004; 



2011 



Koltchinskii . 



Koltchinskii 



2006 



20081: iMassart and Nedeled. l2006l: ICastro and Nowakl l2008l : iHanneke , 



20101). For a no nparam etric class known as boundary frasments. Castro and 



Nowak Q008|) find that active learning some times offers adv antages over passive learning, under 
a special case of Condition [T] Furthermore, iHannekd (1201 ih shows a general result on the label 
complexity achievable by disagreement-based agnostic active learning, which sometimes exhibits 
an improved dependence on the parameters of Condition [T] un der condition s on th e disagreement 
coefficient and certain entropy conditions for (Cj'P) (see also Koltchinskii , 2010l) . These results 
will not play a role in the discussion below, as in the present work we restrict ourselves strictly to 
VC classes, leaving more general results for future investigations. 



6.5 Disagreement-Based Agnostic Active Learning 

Unlike the realizable case, here in the agnostic case we cannot eliminate a classifier from the version 
space after making merely a single mistake, since even the best classifier is potentially imperfect. 
Rather, we take a collection of samples with labels, and eliminate those classifiers making signifi- 
cantly more mistakes relative to some others in the version space. This is the basic idea underlying 
most of the known agnostic active learning algorithms, including those discussed in the present 
work. The precise meaning of "significantly more," sufficient to guarantee the version space always 
contains some good classifier, is typically determined by established bounds on the deviation of 
excess empirical error rates from excess true error rates, taken from the passive learning hterature. 



The following disagreement-based algor ithm is slightly different from any in the exis ting lit- 
erature, but is similar- in style to a met hod of iBevgelzimer. Dasgupta. and LangfordI (120091) ; it als o 



Balcan. Beygelzimer. and LangfordI (l2006a , 



bares resemblence to the algorithms of Koltchinskiil ( 2()10|) : Dasgupta. Hsu, and Monteleoni ( 2007 ): 



20091) . It should be considered as representative of the 



family of disagreement-based agnostic active learning algorithms, and all results below concerning 
it have analogous results for variants of these other disagreement-based methods. 
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Algorithm 4 

Input: label budget n, confidence parameter 6 
Output: classifier h 



0. m ^ 0, i ^ 0, ^0 ^ C, £i ^ 

1. While t < n and m < 2" 

2. m m + 1 

3. IfX„GDIS(y,) 

4. Request the label Ym of X^, and let vCi+i U {(X^, ^m)} and t ■(^ t + I 

5. Else let y be the label agreed upon by classifiers in Vi, and ^Cj+i U {{Xm,y)} 

6. If m = 2*+i 

7. Vi+i ^ IheVi-. eic,^, (h) - mill eic,^, {h') < Ui+i {Vi, 6) 

8. f ^ i + 1, and then £,,+1 ^ 

9. Return any h ^ Vi 



The algorithm is specified in terms of an estimator, Ui. The definition of Ui shoul d typically be 
based on generalization bounds known for passi ve learning. Inspired by the work o f iKoltchinskii 



(120061) and applications thereof in active learning (iHannekd. 1201 ll : iKoltchinskiil. 120101) . we will take 



isp: 

J; 



a definition of Ui based on a data-dependent Rademacher complexity, as follows. Let ■^1,^25 • • • 
denote a sequence of independent Rademacher random variables (i.e., uniform in { — 1, +1}), also 
independent from all other random variables in the algorithm (i.e., Z). Then for any set C C, 
define 

2^ 

R,{n)= sup 2-' Yl U-{hi{X^)-h2{X^)), 

2* 

Di{n)= sup 2-' Yl \hi{Xm) - h2{Xm)\, 

Algorithm 4 operates by repeatedly doubling the sample size |>Ci+i|, while only requesting the 
labels of the points in the region of disagreement of the version space. Each time it doubles the size 
of the sample it updates the version space by eliminating any classifiers that make significantly 
more mistakes on relative to others in the version space. Since the labels of the examples we 
infer in Step 5 are agreed upon by all elements of the version space, the difference of empirical error 
rates in Step 7 is identical to the difference of empirical eiTor rates under the true labels. This allows 
us to use established results on deviations of excess empirical error rates from excess true error rates 
to judge suboptimality of some of the classifiers in the version space in Step 7, thus reducing the 
version space. 

As with Meta- Algorithm 2, for computational feasibility, the sets Vi and DlS(Vi) in Algorithm 
4 can be represented implicitly by a set of constraints imposed by previous rounds of the loop. Also, 
the update to in Step 5 is included only to make Step 7 somewhat simpler or more intuitive; 
it can be be removed without altering the behavior of the algorithm, as long as we compensate by 
multiplying er^ by an appropriate renormalization constant in Step 7: namely, 2~*|£j+i|. 
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We have the following result about the label complexity of Algorithm 4; it is representative of 
the type of theorem one can prove about disagreement-based active learning under Condition [T] 



Lemma 24 Let C be a VC class and suppose the joint distribution VxY on X x { — 1, +1} satisfies 
Condition \I}f or finite parameters p, and n. There is a {C,VxY)-dependent constant c G (0,cx3) 
such that, for any e,6 £ (0, e~^), and any integer 



n > c ■ Of i e 



2 



-2 1 2 1 



if hn is the output of Algorithm 4 when run with label budget n and confidence param.eter 5, then 
on an event of probability at least \ — 5, 

er ihS] <v + e. o 



The proof of this result is essentially simila r to the proof by Hanneke ( 201 lb . combined with 
some simplifying ideas from iKoltchinskiil (|2010|) . It is also implicit in the proof of Lemmal26]below 
(by replacing "dj" with "1" in the proof). The details are omitted. This result leads immediately to 
the following implication concerning the label complexity. 

Theorem 25 Let Cbe a VC class and suppose the joint distribution VxY on X x{ — l, +1} satisfies 
Condition\l\for finite parameters p,K. £ (1, oo). With an appropriate (n, K)-dependent setting of 5, 
Algorithm 4 achieves a label complexity A^ with 

Aa{i^ + e,rxY) = o(ef(e-^^ ■ e^"'' -log^ . o 
Proof Taking 6 = n~2K-2 , the result follows by simple algebra. ■ 



We should note that it is possible to design a kind of wrapper to adaptively determine an appro- 
priate 6 value, so that the algorithm achieves the label complexity guarantee of Theorem |25] without 
requiring any explicit dependence on t he noise p a i'amet er k. Specifically, one can use an idea simi- 



lar to the model selection procedure of iHannekd (|201 1|) for this purpose. However, as our focus in 



this work is on moving beyond disagreement-based active learning, we do not include the details of 
such a procedure here. 

Note that Theorem |25] represents an improvement over the known results for passive learning 
(namely, dS])) whenever 6f{e) is small, and in particular- this gap can be large when Of < oo. The 
results of Lemmal24land Theorem[25]represent the state-of-the-art (up to logarithmic factors) in our 
understanding of the label complexity of agnostic active learning for VC classes. Thus, any signif- 
icant improvement over these would advance our understanding of the fundamental capabilities of 
active learning in the presence of label noise. Next, we provide such an improvement. 

6.6 A New Type of Agnostic Active Learning Algorithm Based on Shatterable Sets 

Algorithm 4 and Theorem |25] represent natural extensions of Meta- Algorithm 2 and Theorem [TOl to 
the agnostic setting. As such, they not only benefit from the advantages of those methods (small 
Of{e) implies improved label complexity), but also suffer the same disadvantages {V{df) > 
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implies no strong improvements over passive). It is therefore natural to investigate whether the im- 
provements offered by Meta-Algorithm 3 and the corresponding Theorem[T6]can be extended to the 
agnostic setting in a similar way. In particular, as was possible for Theorem [16] with respect to The- 
orem [TOl we might wonder whether it is possible to replace 9f ^e^^ in Theorem |25] with 9f (^^^^ 
by a modification of Algorithm 4 analogous to the modification of Meta-Algorithm 2 embodied in 
Meta-Algorithm 3. As we have seen, Of ^^"^ is often significantly smaller in its asymptotic depen- 
dence on e, compared to ^e'^^ , in many cases even bounded by a finite constant when 9f (^^~^ 
is not. This would therefore represent a significant improvement over the known results for active 
learning under Condition [T] Toward this end, consider the following algorithm. 



Algorithm 5 

Input: label budget n, confidence parameter 6 
Output: classifier h 



0. m ^ 0, io ^ 0, Vo ^ C 

1. For A; = 1,2, ... + 1 

2. t^O, ik ^ ik-i, m ^ 2*^ V^^+i ^ Vi^, A^+i ^ 

3. While t < [2~^n\ and m < A; • 2" 

4. m ^ m + 1 

5. If Am {S e Af^-i : V^^+i shatters S U {X^W^^+i shatters S) > 1/2 

6. Request the label Ym of X^, and let ^ ^i^+i U {{^m, ^m)} and t t + 1 

7. Else y ^ argmax P4m{S G : Vjj.+i[(Xm,— y)] does not shatter S\Vii,+i shatters S) 

ye{-i,+i} 

8- Afe+i ^ Afe+i U {{Xm,y)} and Vi^+i ^ Vi^+i[{Xm,y)] 

9. Ifm = 2*'=+i 

10. Vi^+i ^ |/i G Vi^+i : erc^^^.ih) - ^^min^^ eTc,^^,{h') < f/^.+i (1/^,, <5)| 

1 1- ifc ^ ifc + 1, then Vi^+i ^ Vi^, and Q^+i ^ 

12. Return any h G V^^^^+i 



For the argmax in Step 7, we break ties in favor of a y value with Vi,,+i[{Xm,y)] / to 
maintain the invariant that Vi^^+i ^ (see the proof of Lemma [59]): when both y values satisfy this, 
we may break ties arbitrarily. The procedure is specified in terms of several estimators. The P^rn 
estimators, as usual, are defined in Appendix IB. II For Ui, we again use the definition ([7]) above, 
based on a data-dependent Rademacher complexity. 

Algorithm 5 is largely based on the same principles as Algorithm 4, combined with Meta- 
Algorithm 3. As in Algorithm 4, the algorithm proceeds by repeatedly doubling the size of a labeled 
sample while only requesting a subset of the labels in Ci+i, inferring the others. As before, 
it updates the version space every time it doubles the size of the sample and the update elimi- 
nates classifiers from the version space that make significantly more mistakes on compared to 
others in the version space. In Algorithm 4, this is guaranteed to be effective, since the classifiers in 
the version space agree on all of the inferred labels, so that the differences of empirical error rates 
remain equal to the true differences of empirical enw rates (i.e., under the true Ym labels for all 
elements of thus, the established results from the passive learning literature bounding the 

deviations of excess empirical error rates from excess true error rates can be applied, showing that 
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this does not eliminate the best classifiers. In Algorithm 5, the situation is somewhat more subtle, 
but the principle remains the same. In this case, we enforce that the classifiers in the version space 
agree on the inferred labels in by explicitly removing the disagreeing classifiers in Step 8. 
Thus, as long as Step 8 does not eliminate all of the good classifiers, then neither will Step 10. To 
argue that Step 8 does not eliminate all good classifiers, we appeal to the same reasoning as for 
Meta-Algorithm 1 and Meta-Algorithm 3. That is, for k < df and sufficiently large n, as long as 
there exist good classifiers in the version space, the labels y infeiTcd in Step 7 will agree with some 
good classifiers, and thus Step 8 will not eliminate all good classifiers. However, for k > df, the 
labels y in Step 7 have no such guarantees, so that we are only guaranteed that some classifier in 
the version space is not eliminated. Thus, determining guarantees on the error rate of this algorithm 
hinges on bounding the worst excess enw rate among all classifiers in the version space at the con- 
clusion of the k = df round. This is essentially determined by the size of at the conclusion of 
that round, which itself is largely determined by how frequently the algorithm requests labels during 
this k = df round. Thus, once again the analysis rests on bounding the rate at which the frequency 
of label requests shrinks in the k = df round, which determines the rate of growth of \Ci^ \, and thus 
the final guai^antee on the excess eiTor rate. 

As before, for computational feasibility, we can maintain the sets Vi implicitly as a set of con- 
straints imposed by the previous updates, so that we may perform the various calculations required 
for the estimators P as constrained optimizations. Also, the update to in Step 8 is merely 

included to make the algorithm statement and the proofs somewhat more elegant; it can be omit- 
ted, as long as we compensate with an appropriate renormalization of the er^.^^^ values in Step 
10 (i.e., multiplying by 2"**= Additionally, the same potential improvements we proposed 

in Section 15.51 for Meta-Algorithm 3 can be made to Algorithm 5 as well, again with only minor 
modifications to the proofs. 

We should note that this is certainly not the only reasonable way to extend Meta-Algorithm 3 to 
the agnostic setting. For instance, another natural extension of Meta-Algorithm 1 to th e agnostic 



setting , based on a completely different idea, appeai^s in the author's doctoral dissertation (|Hanneke , 



2009bD; that method can be improved in a natural way to take advantage of the sequential aspect of 



active learning, yielding an agnostic extension of Meta-Algorithm 3 differing from Algorithm 5 in 
several interesting ways. 

In the next subsection, we will see that the label complexities achieved by Algorithm 5 are often 
significantly better than the known results for passive learning. In fact, they are often significantly 
better than the presently-known results for any active learning algorithms in the published literature. 

6.7 Improved Label Complexity Bounds for Active Learning with Noise 

Under Condition [TJ we can extend Lemma [24] and Theorem |25] in an analogous way to how The- 
orem [16] extends Theorem [TO] Specifically, we have the following result, the proof of which is 
included in Appendix IE.2I 
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Lemma 26 Let C be a VC class and suppose the joint distribution VxY on X x {— 1,+1} satisfies 
C onditi on \J} for finite parameters p, and k. There is a {C,Vxy) -dependent constant c G (0,oo) 
such that, for any e, (5 G (O, e~'^), and any integer 

n > c • ^/ (^ek j • £« -log — , 

;/ hn is the output of Algorithm 5 when run with label budget n and confidence parameter 5, then 
on an event of probability at least \ — 5, 

er {h,^ <v + e. o 

This has the following implication for the label complexity of Algorithm 5. 

Theorem 27 Let Cbe a VC class and suppose the joint distribution VxY on X x { — 1, +1} satisfies 
Condition\I\for finite parameters p, k £ (1, oo). With an appropriate {n, K)-dependent setting of 5, 
Algorithm 5 achieves a label complexity with 

+ e,VxY) = O (of (e^) • e^'^ • log^ 1 
Proof Taking S = n~^^ , the result follows by simple algebra. ■ 




Theorem |27] represents an interesting generalization beyond the realizable case, and beyond the 
disagreement coefficient analysis. Note that if 9 f{e) = a fg^^ log^^(l/e)) , Theorem |27] re presents 
an improvement over the known results for passive learning (|Massart and Nedelecl 120061) . As we 
always have Of{e) = o we should typically expect such improvements for all but the most 

extreme learning problems. Recall that 6f{e) is often not a (e~^), so that Theorem |27] is often a 
much stronger statement than Theorem |25] In particular, this is a significant improvement over the 
known results for passive learning whenever 9f < oo, and an equally significant improvement over 
Theorem |25] whenever 9f < oo but Of{e) = 0,(1 /e) (see above for examples of this). However, 
note that unlike Meta-Algorithm 3, Algorithm 5 is not an activizer. Indeed, it is not clear (to the 
author) how to modify the algorithm to make it a universal activizer (even for the realizable case), 
while maintaining the guarantees of Theorem l27l 

As with Theorem[T6land Corollary[T7J Algorithm 5 and Theoreml27]can potentially be improved 
in a variety of ways, as outlined in Section 15.51 In particular. Theorem [27] can be made slightly 

sharper in some cases by replacing 6f (^^~^ with the sometimes-smaller (though more complicated) 
quantity (01) (with tq = £«). 



6.8 Beyond Condition [T] 

While Theorem |27] represents an improvement over the known results for agnostic active learn- 
ing, Condition [J is not fully general, and disallows many important and interesting scenarios. In 
particular, one key property of Condition [T] heavily exploited in the label complexity proofs for 
both passive learning and disagreement-based active learning, is that it implies diam(C(e)) — 
as e — )• 0. In scenarios where this shrinking diameter condition is not satisfied, the existing 
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proofs of ^ for passive learning break down, and furthermore, the disagreement-based algo- 
rithms themselves cease to give significant improvements over passive learning, for essentially 
the same reasons leading to the "only if" part of Theorem [5] (i.e., the sampling region never fo- 
cuses beyond some nonzero-probability region). Even more alarming (at first glance) is the fact 
that this same problem can sometimes be observed for the k = df round of Algorithm 5; that is, 

V (^x : V^s-^{S G X^!-^ : Vi- +i shatters S U {x]\Vi. shatters S) > 1/2^ is no longer guar- 
anteed to approach as the budget n increases (as it does when diam(C(e)) — 0). 

Thus, if we wish to approach an understanding of improvements achievable by active learning in 
general, we must come to terms with scenarios where diam(C(e)) does not shrink to zero. Toward 
this goal, it will be helpful to partition the distributions into two distinct categories, which we will 
refer to as the benign noise case and the misspecified model case. The VxY in the benign noise 
case are characterized by the property that i'{C;Vxy) = i^*{'Pxy)', this is in some ways similar 
to the realizable case, in that C can approximate an optimal classifier, except that the labels are 
stochastic. In the benign noise case, the only reason diam(C(e)) would not shrink to zero is if there 
is a nonzero probability set of points x with r/(x) = 1/2; that is, there ai^e at least two classifiers 
achieving the Bayes en^or rate, and they are at nonzero distance from each other, which must mean 
they disagree on some points that have equal probability of either label occurring. 

Interestingly, it seems that in the benign noise case, diam(C(e)) ^ might not be a problem 
for algorithms based on shatterable sets, such as Algorithm 5. In particular. Algorithm 5 appears to 
continue exhibiting reasonable behavior in such scenarios. That is, even if there is a nonshrinking 
probability that the query condition in Step 5 is satisfied for k = df,on any given sequence Z there 
must be some smallest value of k for which this probability does shrink as n — oo. For this value 
of k, we should expect to observe good behavior from the algorithm, in that (for sufficiently large 
n) the inferred labels in Step 7 will tend to agree with some optimal classifier. Thus, the algorithm 
addresses the problem of multiple optimal classifiers by effectively selecting one of the optimal 
classifiers. 

To illustrate this phenomenon, consider learning with respect to the space of threshold classifiers 
(Example [B with V uniform in [0,1], and let {X,Y) ~ Vxv satisfy F(Y = +1\X) = for 
X < 1/3, P(y = +1\X) = 1/2 for 1/3 < X < 2/3, and ¥{Y = +1\X) = 1 for 2/3 < X. As 
we know from above, dj = 1 here. However, in this scenario we have DIS(C(e)) — [1/3, 2/3] as 
e — 0. Thus, Algorithm 4 never focuses its queries beyond a constant fraction of X, and therefore 
cannot improve over certain passive learning algorithms in terms of the asymptotic dependence of 
its label complexity on e (assuming a worst-case choice of h in Step 9). However, for k = 2 
in Algorithm 5, every Xm will be assigned a label y in Step 7 (since no 2 points are shattered); 
furthermore, for sufficiently large n we have (with high probability) DlS(yjj) not too much larger 
than [1/3, 2/3], so that most points in DIS(ViJ can be labeled either +1 or —1 by some optimal 
classifier. For us, this has two implications. First, the S G [1/3, 2/3]^ will (with high probability) 
dominate the votes for y in Step 7, so that the y inferred for any Xm ^ [1/3, 2/3] will agree with 
all of the optimal classifiers. Second, the inferred labels y for Xm G [1/3, 2/3] will definitely agree 
with some optimal classifier. Since we also impose the h{Xm) = y constraint for 1^2+1 in Step 8, 
the inferred y labels must all be consistent with the same optimal classifier, so that Vi^j^i will quickly 
converge to within a small neighborhood around that classifier, without any further label requests. 
Note, however, that the particular optimal classifier the algorithm converges to will be a random 
variable, determined by the particular sequence of data points processed by the algorithm; thus, it 
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cannot be determined a priori, which significantly complicates any general attempt to analyze the 
label complexity achieved by the algorithm for arbitrary C and VxY satisfying the benign noise 
condition. In particular, for some C and Vxy, even this minimal k for which convergence occurs 
may be a nondeterministic random variable. At this time, it is not entirely clear how general this 
phenomenon is (i.e.. Algorithm 5 providing improvements over certain passive algorithms even for 
benign noise distributions with diam(C(e)) 0), nor how to characterize the label complexity 
achieved by Algorithm 5 in general benign noise settings where diam(C(e)) ^ 0. 

However, as mentioned earlier, there are other natural ways to generalize Meta-Algorithm 3 to 
handle noise, some of which have m ore predictable be havior in the general benign noise setting. In 



particular, the original thesis work of lHannekd (l2009bl) explores a technique for active learning with 



benign noise, which unlike Algorithm 5, only uses the requested labels, not the inferred labels, and 
as a consequence never eliminates any optimal classifier from V. Because of this fact, the sampling 
region for each k converges to a predictable limiting region, so that we have an accurate a priori 
characterization of the algorithm's behavior. However, it is not immediately clear (to the author) 
whether this alternative technique might lead to a method achieving results similar to Theorem [27l 
In contrast to the benign noise case, in the misspecified model case we have u{C;Vxy) > 
^* (Vxy)- In this case, if the diameter does not shrink, it is because of the existence of two classifiers 
h\,h2 G cl(C) achieving error rate v{£.;Vxy), with V{x : hi{x) / h2{x)) > 0. However, unlike 
above, since they do not achieve the Bayes error rate, it is possible that a significant fraction of the 
set of points they disagree on may have 7]{x) / 1/2. Intuitively, this makes the active learning 
problem more difficult, as there is a worry that a method such as Algorithm 5 might infer the label 
h2{x) for some point x when in fact hi{x) is better for that particular x, and vice versa for the 
points X where /i2 (x) would be better, thus getting the worst of both and potentially doubling the 
error rate in the process. However, it turns out that, for the purpose of exploring Conjecture |23l 
we can circumvent all of these issues by noting that there is a trivial solution to the misspecified 
model case. Specifically, since in our present context we are only interested in the label complexity 
for achieving error rate better than + e, we ca n simply turn to any algorithm that asymptotically 



achieves an error rate strictly better than v (e.g.. lDevroye et al.Ul996l) . in which case the algorithm 



should require only a finite constant number of labels to achieve an expected error rate better than 
v. To make the algorithm effective for the general case, we simply split our budget in three: one 
part for an active learning algorithm, such as Algorithm 5, for the benign noise case, one part for the 
method above handling the misspecified model case, and one part to select among their outputs. The 
full details of such a procedure are specified in Appendix IE.3I along with a proof of its performance 
guarantees, which are summarized as follows. 

Theorem 28 Fix any concept space C. Suppose there exists an active learning algorithm Aa 
achieving a label complexity A^. Then there exists an active learning algorithm A'^ achieving a 
label complexity A^ such that, for any distribution VxY on X x { — 1,+!}, there exists a function 
A(e) G Polylog(l/e) such that 



A',{v + e,VxY) < 



max {2Aa(i/ + e/2, Vxy), A(e)} , in the benign noise case 
A(e), in the misspecified model case 



The main point of Theorem |28] is that, for our purposes, we can safely ignore the misspeci- 
fied model case (as its solution is a trivial extension), and focus entirely on the performance of 
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algorithms for the benign noise case. In particular, for any label complexity Ap, every VxY £ 
Nontrivial(Ap; C) in the misspecified model case has A'^{v + e,VxY) = o{Ap{iJ + e^Vxy)), 
for A'^ as in Theorem [28] Thus, if there exists an active meta-algorithm achieving the strong im- 
provement guarantees of an activizer for some passive learning algorithm Ap (Definition [21]) for all 
distributions VxY in the benign noise case, then there exists an activizer for Ap with respect to C 
in the agnostic case. 



7. Open Problems 

In some sense, this work raises more questions than it answers. Here, we list several problems that 
remain open at this time. Resolving any of these problems would make a significant contribution to 
our understanding of the fundamental capabilities of active learning. 



We have established the existence of universal activizers for VC classes in the realizable case. 
However, we have not made any serious attempt to characterize the properties that such ac- 
tivizers can possess. In particular, as mentioned, it would be interesting to know whether 
activizers exist that preserve certain favorable properties of the given passive learning algo- 
rithm. For instance, we know that some passive learning algorithms (say, for linear sepai^ators) 
achieve a label complexity that is ind ependent of the dimensionality of the space X, under 
a large margin condition on / and V ( Balcan. Blum, and Vempalal 2006bh . Is there an ac- 
tivizer for such algorithms that preserves this large-margin-based dimension-independence in 
the label complexity? Similarly, there ai^e passive algorithms whose label complexity has a 
weak dep endence on dimensionality, due to sparsitv considerations ( Bunea. Tsybakov, and 
Wegkamp, 2009 : Wang and ShenL 2007 ). Is there an activizer for these algorithms that pre- 
serves this sparsity-based weak dependence on dimension? Is there an activizer that preserves 
adaptiveness to the dime nsion of the manifold to wh ich V is restricted? What about an ac- 
tivizer that is sparsistent (IRocha. Wang, and Yul.l2009h . given any spai^sistent passive learning 
algorithm as input? Is there an activizer that preserves admissibility, in that given any ad- 
missible passive learning algorithm, the activized algorithm is an admissible active learning 
algorithm? Is there an activizer that, given any minimax optimal passive learning algorithm 
as input, produces a minimax optimal active learning algorithm? What about preserving other 
notions of optimality, or other properties? 



• There may be some waste in the above activizers, since the label requests used in their ini- 
tial phase (reducing the version space) are not used by the passive algorithm to produce the 
final classifier. This guarantees the examples fed into the passive algorithm are conditionally 
independent given the number of examples. Intuitively, this seems necessary for the gen- 
eral results, since any dependence among the examples fed to the passive algorithm could 
influence its label complexity. However, it is not clear (to the author) how dramatic this effect 
can be, nor whether a simpler strategy (e.g., slightly randomizing the budget of label requests) 
might yield a similar effect while allowing a single-stage approach where all labels are used in 
the passive algorithm. It seems intuitively clear that some special types of passive algorithms 
should be able to use the full set of examples, from both phases, while still maintaining the 
strict improvements guaranteed in the main theorems above. What general properties must 
such passive algorithms possess? 
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As previously mentioned, the vast majority of empirically-tested heuristic active learning al- 
gorithms in the published literature are designed in a reduction style, using a well-known 
passive learning algorithm as a subroutine, constructing sets of labeled examples and feed- 
ing them into the passi ve learning algorithrn at va r ious points in the execution of the active 
learn ing algorithm (e.g.. Abe and Mamitsuka ^ 19981: McCallum and Nigaml 19981: Schohn and 
Co hn.l2000l:ICampbell. Cristianini. and SmolalboodlXong and KoUeiillOOll: Rov and McCal 



lum, I2OOI': 'Mus ea. Min ton. and KnoWockl, hml, ILindenbaum. Markovitch. and Rusakovl 
2004: iMitra. Murth v. and Pall l2004l: iRoth and SmallL 12004 ISchein and Ungaii I2OO7I: Bm- 



Peled, Roth, and Zimak. l2007FlBeygelzimer. Dasgupta. and LangfordL l2009h . However, rather 
than including some examples whose labels are requested and other examples whose labels 
are inferred in the sets of labeled examples given to the passive learning algorithm (as in our 
rigorous methods above), these heuristic methods typically only input to the passive algo- 
rithm the examples whose labels were requested. We should expect that meta-algorithms of 
this type could not be universal activizers, but perhaps there do exist meta-algorithms of this 
type that are activizers for every passive learning algorithm of some special type. What are 
some general conditions on the passive learning algorithm so that some meta-algorithm of this 
type (i.e., feeding in only the requested labels) can activize every passive leai^ning algorithm 
satisfying those conditions? 

As discussed earlier, the definition of "activizer" is based on a trade-off between the strength 
of claimed improvements for nontrivial scenarios, and ease of analysis within the framework. 
There are two natural questions regarding the possibility of stronger notions of "activizer." In 
Definition |3] we allow a constant factor c loss in the e argument of the label complexity. In 
most scenarios, this loss is inconsequential (e.g., typically Ap(e/c, f^V) = 0{Ap{e, f,V))), 
but one can construct scenarios where it does make a difference. In our proofs, we see that 
it is possible to achieve c = 3; in fact, a careful inspection of the proofs reveals we can even 
get c = (1 + 0(1)), a function of e, converging to 1. However, whether there exist universal 
activizers for every VC class that have c = 1 remains an open question. 

A second question regards our notion of "nontrivial problems." In Definition |3l we have 
chosen to think of any target and distribution with label complexity growing faster than 
Polylog(l/e) as nontrivial, and do not require the activized algorithm to improve over the 
underlying passive algorithm for scenarios that are trivial for the passive algorithm. As men- 
tioned. Definition [3] does have implications for the label complexities of these problems, 
as the label complexity of the activized algorithm will improve over every nontrivial up- 
per bound on the label complexity of the passive algorithm. However, in order to allow for 
various operations in the meta-algorithm that may introduce additive Polylog(l/e) terms due 
to exponentially small failure probabilities, such as the test that selects among hypotheses in 
ActiveSelect, we do not require the activized algorithm to achieve the same order of label 
complexity in trivial scenarios. For instance, there may be cases in which a passive algo- 
rithm achieves 0(1) label complexity for a particular (/, V), but its activized counterpart has 
0(log(l/e)) label complexity. The intention is to define a framework that focuses on non- 
trivial scenarios, where passive learning uses prohibitively many labels, rather than one that 
requires us to obsess over extra additive logarithmic terms. Nonetheless, there is a question 
of whether these losses in the label complexities of trivial problems are necessary to gain the 
improvements in the label complexities of nontrivial problems. There is also the question of 



dL l2009h . 



57 



Hanneke 



how much the definition of "nontrivial" can be relaxed. Specifically, we have the following 
question: to what extent can we relax the notion of "nontrivial" in Definition |3l while still 
maintaining the existence of universal activizers for VC classes? We see from our proofs that 
we can at least replace Polylog(l/e) with log(l/e). However, it is not clear whether we can 
go further than this in the reahzable case (e.g., to say "nontrivial" means a;(l)). When there is 
noise, it is clear that we cannot relax the notion of "nontrivial" beyond replacing Polylog(l / s) 
with log(l/e). Specifically, whenever DIS(C) 7^ 0, for any label complexity achieved by 
an active leai^ning algorithm, there must be some VxY with Aa(z/ + e, Vxy) = ^^(log(l/e)), 
even with the support of V restricted to a single point x € DIS (C) ; the proof of this is via a 
reduction from sequential hypothesis testing for whether a coin has bias a or 1 — q, for some 
a E (0, 1/2). Since passive learning via empirical risk minimization can achieve label com- 
plexity Ap(z/ + e^Vxy) = 0(log(l/e)) whenever the support of V is restricted to a single 
point, we cannot further relax the notion of "nontrivial," while preserving the possibility of a 
positive outcome for Conjecture [23] It is interesting to note that this entire issue vanishes if 
we are only interested in methods that achieve error at most e with probability at least 1 — 5, 
where 5 G (0, 1) is so rne acceptable constant failure probabilitv. as in the work of Balcan. 
Hanneke, and Vaughan (I2OIOI) : in this case, we can simply take "nontrivial" to mean a;(l) la- 
bel complexity, and both Meta- Algorithm 1 and Meta- Algorithm 3 remain universal activizers 
under this alternative definition, and achieve 0(1) label complexity in trivial scenarios. 

• Another interesting question concerns efficiency. Suppose there exists an algorithm to find 
an element of C consistent with any labeled sequence C in time polynomial in |£| and d, 
and that Ap{C) has running time polynomial in |£| and d. Under these conditions, is there 
an activizer for Ap capable of achieving an error rate smaller than any e in running time 
polynomial in 1/e and d, given some appropriately large budget n? Recall that if we knew 
the value of df and dj < clog d, then Meta- Algorithm 1 could be made efficient, as discussed 
above. Therefore, this question is largely focused on the issue of adapting to the value of df. 
Another related question is whether there is an efficient active learning algorithm achieving 
the label complexity bound of Corollary |7] or Corollary [TTl 

• One question that comes up in the results above is the minimum number of batches of label 
requests necessary for a universal activizer. In Meta- Algorithm and Theorem |5l we saw 
that sometimes two batches are sufficient: one to reduce the version space, and another to 
construct the labeled sample by requesting only those points in the region of disagreement. 
We certainly cannot use fewer than two batches in a universal activizer, for any nontrivial 
concept space, so that this represents the minimum. However, to get a universal activizer 
for every concept space, we increased the number of batches to three in Meta- Algorithm 1 . 
The question is whether this increase is really necessary. Is there always a universal activizer 
using only two batches of label requests, for every VC class C? 

• For some C, the learning process in the above methods might be viewed in two components: 
one component that performs active leai^ning as usual (say, disagreement-based) under the 
assumption that the target function is very simple, and another component that searches for 
signs that the target function is in fact more complex. Thus, for some natural classes such 
as linear separators, it would be interesting to find simpler, more specialized methods, which 
explicitly execute these two components. For instance, for the first component, we might con- 
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sider the u sual margin-based active learning methods, wh i ch query near a current guess of th e 
separator (IDasgupta. Kalai. and Monteleonil . 120051 12009|; iBalcan. Broder. and Zhangl 120071) . 
except that we bias toward simple hypotheses via a regularization penalty in the optimization 
that defines how we update the separator in response to a query. The second component might 
then be a simple random search for points whose correct classification requires larger values 
of the regularization term. 



Can we construct universal activizers for some concept spaces with infinite VC dimension? 
What abou t under some constraints on the distribution V or VxY (e■g■^ the usual entropy 
conditions (Ivan der Vaart and Wellneii . Il996h )? It seems we can still run Meta- Algorithm 1, 
Meta- Algorithm 3, and Algorithm 5 in this case, except we should increase the number of 
rounds (values of k) as a function of n; this may continue to have reasonable behavior even 
in some cases where df = oo, especially when V^{d^ f) — as A; — oo. However, it is not 
clear whether they will continue to guarantee the strict improvements over passive learning 
in the realizable case, nor what label complexity guarantees they will achieve. One specific 
question is whether there is a method always achieving label complexity o ( e 



where 

p is from the entropy conditions ( van der Vaart and Wellner , 19961) and At is from Condition [T] 
This wou ld be an improvement oyer the known results for passive learning ( Mammen and 
Tsvbakov" ]r999l : iTsvbakovL l2004l: iKoltchinskiil. 120061) . Another related question is whether 
we can improve over the known results for active learning in these scenarios. Specifically, 

^ ' on the label complexity of a certain 



Hannekd (120111) proved a bound of (9 (^C// (^e « j e 
disagreement-based active learning method, under entropy conditions and Condition [T] Do 
there exist active learning methods achieving asymptotically smaller label complexities than 

this, in particular improving the 9j {^^^^ factor? The quantity Of {^^^ is no longer defined 
when = oo, so this might not be a direct extension of Theorem l27l but we could perhaps 



use the sequence of 6^ 



(k) 



values in some other way to replace ( e «; j in this case. 



• There is also a question about generalizing this approach to label spaces other than { — 1, +1}, 
and possibly other loss functions. It should be straightforward to extend these results to the 
setting of multiclass classification. However, it is not clear what the implications would be 
for general structured prediction problems, where the label space may be quite large (even 
infinite), and the loss function involves a notion of distance between labels. From a practical 
perspective, this question is particularly interesting, since problems with more complicated 
label spaces are often the scenarios where active learning is most needed, as it takes substan- 
tial time or effort to label each example. At this time, there aie no published theoretical results 
on the label complexity improvements achievable for general structured prediction problems. 



• All of the claims in this work also hold when Ap is a semi-supervised passive learning al- 
gorithm, simply by withholding a set of unlabeled data points in a preprocessing step, and 
feeding them into the passive algorithm along with the labeled set generated by the activizer. 
However, it is not clear whether further claims are possible when activizing a semi-supervised 
algorithm, for instance by taking into account specific details of the learning bias used by the 
particular semi-supervised algorithm (e.g., a cluster assumption). 
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The splitting index analysis of lDasguptal (l2005h has the interesting feature of characterizing a 



trade-off between the number of label requests and the number of unlabeled examples used 
by the active learning algorithm. In the present work, we do not characterize any such trade- 
off. Indeed, the algorithms do not really have any parameter to adjust the number of unlabeled 
examples they use (aside from the precision of the P estimators), so that they simply use as 
many as they need and then halt. This is true in both the realizable case and in the agnostic 
case. It would be interesting to try to modify these algorithms and their analysis so that, 
when there are more unlabeled examples available than would be used by the above methods, 
the algorithms can take advantage of this in a way that can be reflected in improved label 
complexity bounds, and when there are fewer unlabeled examples available, the algorithms 
can alter their behavior to compensate for this, at the cost of an increased label complexity. 
This would be interesting both for the realizable and agnostic cases. In fact, in the agnostic 
case, there are no known methods that exhibit this type of trade-off. 

• Finally, as mentioned in the previous section, there is a serious question concerning what 
types of algorithms can be activized in the agnostic case, and how large the improvements in 
label complexity will be. In particular. Conjecture |23] hypothesizes that for any VC class, we 
can activize some empirical risk minimization algorithm in the agnostic case. Resolving this 
conjecture (either positively or negatively) should significantly advance oui" understanding of 
the capabilities of active learning compared to passive learning. 

Appendix A. Proofs Related to Section |3l Disagreement-Based Learning 

The followin g result follow s from a theorem of I Anthony and Bartlett (Il999l) . based on the clas- 



sic results of IVapnikI (|1982|) (with slishtlv better constant factors'): see also the work of Blumer. 



Ehr^enfeucht, Haussler, and Warmuth (|1989i) . 

Lemma 29 For any VC class C, m S N, and classifier f such that Vr > 0, B(/, r) 7^ 0, let 
= {h £ C : \/i < m,h{Xi) = f{Xi)}; for any 6 G (0,1), there is an event Hm{^) with 
P {Hm{5)) >l - 6 such that, on Hm{5), C B(/, (j){m] 6)), where 

(Am; <5 = 2 ^ o 

m 

A fact we will use repeatedly is that, for any N{e) = (xi(log(l/e)), we have (l){N{e);e) = o(l). 

Lemma 30 For _P„,(DIS(y))/rom O, on an event Jn with P( J„,) > 1 - 2 • exp{-n/4}, 

max{P(DIS(F)),4/n} < P„(DIS(y)) < max {4P(DIS(F)), 8/n} . o 

Proof Note that the sequence ZY„ from ([T|) is independent from both V and £. By a Chernoff bound, 
on an event J„ with P( Jn) > 1 — 2 • exp{— n/4}, 

V{mS{V)) > 2/n =^ , ^(DIS(^)) g [1/2^2], 
andP(DIS(y)) < 2/n ^ 5] lDis(y)(^) < 
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This immediately implies the stated result. 



Lemma 31 Let A : (0,1) — )■ (0,oo) and L : N x (0, 1) — ;> [0, oo) be such that A(e) = w(l), 
L{n,e) is at n = 1 and is diverging as n ^ oo for every e £ (0,1), and for any N-valued 
iV(e) = u}{X{e)), L{N{e),e) = u}{N{e)). Let L~'^ {m; e) = max{n G N : L{n,e) < m}, for any 
m£ (0,oo). Then for any K{£) = lo{\{£)), L^i(A(e);e) =o(A(e)). o 

Proof First note that L^^ is well-defined and finite, due to the facts that L(n, e) can be and is 
diverging in n. Let A(e) = a;(A(e)). It is fairly straightforward to show L^^(A(e);e) / Q.{K{e)), 
but the stronger o(A(e)) result takes slightly more work. Let L{n,e) = min {L(n, e), n^/A(e)} 
for every n G N and e E (0, 1), and let L~^{m; e) = max |n S N : Z(n, e) < m|. We will first 
prove the result for L. 

Note that by definition of we know 

{L-'{Aie);e) + lf /Xie)>L{L'\A{e);e) + l,e)>A{e)=oj{X{e)), 

which implies (A(e); e) = uj{X{e)). But, by definition of L^^ and the condition on L, 

A{e) > I (Z-i (A(e); e),e)=u {L~' (A(e); e)) . 

Since L~^{m;e) > L~^(?n;e) for all in, this implies A{e) = u {L~^ (A(e);e)), or equivalently 
L-i(A(e);e) = o(A(e)). ■ 



Lemma 32 For any VC class C and passive algorithm Ap, if Ap achieves label complexity Ap, 
then Meta-Algorithm 0, with Ap as its argument, achieves a label complexity Aq such that, for 
every f £ C and distribution V over X, ifV{dc^-pf) = and oo > Ap{e,f,V) = a;(log(l/e)), 
thenAa{2eJ,V)=o{Ap{e,f,V)). ' o 

Proof This proof follows similar lines to a proof of a related result of Balcan. Hanneke, and 
Vaughan (I2010h . Suppose Ap achieves a label complexity Ap, and that / G C and distribu- 
tion V satisfy oo > Ap{e,f,V) = uj{log{l/e)) and V{dc,vf) = 0. Let e G (0,1). For 
n E N, let A„(e) = P(DIS(B(/, 0( [n/2j ; e/2)))), L(n;e) = max{32/n, 16A„(e)}J , and 
form G (0,oo) let L~-'^(m; e) = max{n G N : L{n;e) < m}. Suppose 

n > max|l21n(6/e),l + L^^(Ap(e,/,P);e) |. 

Consider running Meta-Algorithm with Ap and n as arguments, while / is the target function and 
V is the data distribution. Let V and £ be as in Meta-Algorithm 0, and let hn = Ap{C) denote the 
classifier returned at the end. 

By Lemma 113 on the event i?L„/2j (e/2), V C B(/, 0( [n/2j ; e/2)), so that r{BlS{V)) < 
A„(e). Letting U = {Xl„/2J+i, • • • > ^L„/2j + [n/(4A)j}' Lemma[30l on FLn/2j (e/2) n J„ we 
have 

[n/max{32/n, 16A„(e)}J < \U\ < [n/max{4P(DIS(y)), 16/n}J . (8) 
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By a Chernoff bound, for an event Kn with F{Kn) > 1 — exp{— n/12}, on H^n/2\ (e/2) n J„ n Kn, 
\UnDlS{V)\ < 2V(I)IS{V)) ■ [n/max{4V(DlS{V)),l6/n}\ < \n/2]. Defining the event 
G„(e) = -ff[n/2j (s/2) n Jn n Kn, we see that on Gnie), every time Xm G DlS(y) in Step 5 of 
Meta-Algorithm 0, we have t < n; therefore, since f ^ V imphes that the infeiTed labels in Step 6 
are correct as well, we have that on G„ (e), 



V(x,y) £ C,y = f{x). 



(9) 



Noting that 



P {Gnief) < P (i?L„/2j + P (J^i) + P (A'^;) < e/2 + 2 • exp {-n/4} + exp{-n/12} < e, 

we have 



E 



er I h 



< E 



lG„ie)l[\C\>Ap{e,f,r)]eT[K] +F {Gn{e) n {\£\ < Ap{eJ,r)}) + ¥ {Gn{e) 



<E[lG„(e)l[|^l > Ap(£,/,P)]erMp(£))] + P (G„(e) n {|£| < Ap(e, /, P)}) + e. (10) 

On Gn{£), ^ implies \C\ > L{n; e), and we chose n lai^ge enough so that L(n; e) > Ap(e, /, "P). 
Thus, the second term in (fTOb is zero, and we have 



E 



er (hn 



< E 1[\C\> Ap{e, f, V)] er {Ap (£))] + e 



E 



E 



lG„(e)^HAi^)) 1^1 l[\C\>Ap{e,f,r)] 



+ e. 



(11) 



For any £ G N with P(|£| = £) > 0, the conditional of U\{\U\ = £} is a product distribution V^; 
that is, the samples in U are conditionally independent and identically distributed with distribution 
V, which is the same as the distribution of {Xi,X2, ■ ■ ■ jX^}. Therefore, for any such £ with 
^ > Ap{e, f, V),hy^ we have 



E 



lG^ie)eriAp{C)) {\C\=i} <E[eT{Ap{Ze))]<e. 



In particular, this means ([TT] ) is at most 2e. This implies Meta-Algorithm 0, with Ap as its argument, 
achieves a label complexity A^ such that 

Aa{2e, /, V) < max { 12 ln(6/e), 1 + L"^ (Ap(e, /, P); e) } . 

Since Ap{e, f,V) = a;(log(l/e)) =^ 121n(6/e) = o(Ap(e, f,V)), it remains only to show that 
(Ap(e,/,P);e) = o{Ap{e, f,V)). Note that VeG (0, 1), L(l;e) =OandL(n;e) is diverging 
in n. Furthermore, by the assumption V{dc,vf) = 0, we know that for any N{e) = a;(log(l/e)), 
we have A^(£')(e) = o(l) (by continuity of probability measures), which implies L{N{e);e) = 
uj{N{e)). Thus, since Ap{eJ,V) = a;(log(l/e)). Lemma ED implies L^^ (Ap(e, /, P); e) = 
o (Ap(e, as desired. ■ 
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Lemma 33 For any VC class C, target function / € C, and distribution V, ifV{dc,vf) > 0, then 
there exists a passive learning algorithm Ap achieving a label complexity Ap such that (/, V) G 
Nontrivial(Ap), and for any label complexity achieved by running Meta-Algorithm with Ap 
as its argument, and any constant c G (0, oo), 

K{ce,f,V)^o{Kp{eJ,V)). o 



Proof The proof can be broken down into three essential claims. First, it follows from Lemma [35] 
below that, on an event H' of probability one, V{dvf) > V{dcf); since V{mS{V)) > V{dvf), 
we have P(DIS(y)) > V{dcf) on H' . 

The second claim is that on H' n Jn, \C\ = 0{n). This follows from Lemma [30l and our first 

claim by noting that, on H' n J„, \C\ = n/(4A) < n/{AV{BlS{V))) < n/{4P{dcf)). 

Finally, we construct a passive algorithm Ap whose label complexity is not significantly im- 
proved when \C\ = 0{n). There is a fairly obvious randomized Ap with this property (simply 
returning — / with probability 1/|£|, and otherwise /); however, we can even satisfy the property 
with a deterministic Ap, as follows. Let 'Hf = {hi}°^^ be any sequence of classifiers (not neces- 
sarily in C) with < V{x : hi{x) 7^ f{x)) strictly decreasing to 0, (say with hi = — /). We know 
such a sequence must exist since V{dcf) > 0. Now define, for nonempty S, 

Ap{S) = argminP(x : h,{x) ^ f{x)) + 21[o,i/|5|)(7^(a: : hi{x) ^ f{x))). 

Ap is constructed so that, in the special case that this particular / is the target function and this 
particular V is the data distribution, Ap{S) returns the hi ^ Tij with minimal er(/ij) such that 
er(/ii) > l/\S\. For completeness, let .4^(0) = hi. Define Si = er(/ij) = V{x : hi{x) 7^ f{x)). 

Now let hn be the returned classifier from running Meta-Algorithm with Ap and n as inputs, let 
Ap be the (minimal) label complexity achieved by Ap, and let A^ be the (minimal) label complexity 
achieved by Meta-Algorithm with Ap as input. Take any c G (0, 00), and i sufficiently large so 
that ei_i < 1/2. Then we know that for any e G [ej, Ap(e, /, "P) = [l/si]. In particular, 
Ap{e,f,V) > 1/e, so that {f,V) G Nontrivial(Ap). Also, by Markov's inequality and the above 
results on 



E[eT{hn)] > E 



> W)p(^' n J„) > (1 - 2 . exp{-„/4)) 



This implies that for 4 ln(4) < n < 
large i, 



n 



n 



, we have E 



K[cei,f,V) > > 



ce,; 



1 



eic{hn) > cEi, so that for all sufficiently 



Since this happens for all sufficiently large i, and thus for arbitrarily small ej values, we have 

Aa{ceJ,V)^o{Ap{e,f,V)). 
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Proof [Theorem [21 Theorem [5] now follows directly from Lemmas [32l and [33l corresponding to the 



Appendix B. Proofs Related to Section 3]: Basic Activizer 

In this section, we provide detailed definitions, lemmas and proofs related to Meta- Algorithm 1. 

In fact, we will develop slightly more general results here. Specifically, we fix an arbitrary 
constant 7 G (0, 1), and will prove the result for a family of meta-algorithms parameterized by the 
value 7, used as the threshold in Steps 3 and 6 of Meta- Algorithm 1, which were set to 1/2 above to 
simplify the algorithm. Thus, setting 7 = 1/2 in the statements below will give the stated theorem. 

Throughout this section, we will assume C is a VC class with VC dimension d, and let V 
denote the (arbitrary) mai^ginal distribution of Xj (Vi). We also fix an arbitrary classifier / G cl(C), 
where (as in Section [6l) cl(C) = {/i : Vr > 0,B(/i, r) / 0} denotes the closure of C. In the 
present context, / corresponds to the target function when running Meta- Algorithm 1. Thus, we 
will study the behavior of Meta- Algorithm 1 for this fixed / and V; since they are chosen arbitrarily, 
to establish Theorem [6] it will suffice to prove that for any passive Ap, Meta- Algorithm 1 with Ap 
as input achieves superior label complexity compared to Ap for this / and V. In fact, because here 
we only assume / G cl(C) (rather than / G C), we actually end up proving a slightly more general 
version of Theorem [6l But more importantly, this relaxation to cl(C) will also make the lemmas 
developed below more useful for subsequent proofs: namely, those in Appendix IE.2I For this same 
reason, many of the lemmas of this section are substantially more general than is necessary for the 
proof of Theorem[6l the more general versions will be used in the proofs of results in later sections. 

For any m G N, we define V;^ = {heC:\fi<m,h{Xi) = f{Xi)}. Additionally, for H <^ C, 
and an integer A; > 0, we will adopt the notation 



and as in Section |5l we define the A;-dimensional shatter core of / with respect to Ti (and V) as 



"if" and "only if" parts of the claim, respectively. 




)^/ = lim5'= (B^(/,r)) 



and further define 




Also as in Section |5l define 




For convenience, we also define the abbreviation 
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Also, recall that we are using the convention that = {0}, V^{X^) = 1, and we say a set of 
classifiers n shatters iff 7^ / {}. In particular, cS°(H) / {} iff ^ / {}, and 5^/ / {} iff 
inihen'Pix : h{x) / f{x)) = 0. For any measurable sets 51,52 C with V^{S2) > 0, as 
usual we define V''{Si\S2) = V^{Si n S2) /V" {82); in the situation where V''{S2) = 0, it will be 
convenient to define 'P'^(5i|52) = 0. We use the definition of er(/i) from above, and additionally 
define the conditional error rate er(/i|5) = V{{x : h{x) 7^ f{x)}\S) for any measurable S ^ X. 
We also adopt the usual short-hand for equalities and inequalities involving conditional expectations 
and probabilities given random variables, wherein for instance, we write E[X|y] = Z to mean that 
there is a version of E[X|y] that is everywh ere equal to Z, so that i n part icular, any version of 
E[X|y] equals Z almost everywhere (see e.g.. Ash and Doleans-Dade , 2000h . 



B.l Definition of Estimators for Meta- Algorithm 1 

While the estimated probabilities used in Meta-Algorithm 1 can be defined in a variety of ways to 
make it a universal activizer, in the statement of Theorem |6] above and proof thereof below, we take 
the following specific definitions. After the definition, we discuss alternative possibilities. 

Though it is a slight twist on the formal model, it will greatly simplify our discussion be- 
low to suppose we have access to two independent sequences of i.i.d. unlabeled examples Wi = 
{wi,W2, . . .} and W2 = {w'i,w'2, ■ ■ •}, also independent from the main sequence {Xi, X2, . . .}, 
with Wi,w[ ~ V. Since the data sequence {Xi, X2, . . .} is i.i.d., this is distributionally equivalent to 
supposing we partition the data sequence in a preprocessing step, into three subsequences, alternat- 
ingly assigning each data point to either Z'-^, Wi, or W2- Then, if we suppose = {X[, X2, . . .} , 
and we replace all references to Xi with X[ in the algorithms and results, we obtain the equivalent 
statements holding for the model as originally stated. Thus, supposing the existence of these Wi 
sequences simply serves to simplify notation, and does not represent a further assumption on top of 
the previously stated framework. 

For each /c > 2, we partition W2 into subsets of size k — 1, as follows. For i G N, let 

Q(k) r I / 

We define the Pm estimators in terms of three types of functions, defined below. For any H C, 

x ^ X,y ^ { — 1, +1}, m G N, we define 

Pm (S G X^-^ : n shatters S U {x}\rL shatters = A(^)(x, 1^2,^), (12) 

Pm{seX^^^ does not shatters]?^ shatters 5) =V^J^\x,y,W2,'H), (13) 

Pm [x:P[se X^^^ : U shatters S U {x]\H shatters > 7) = ^tKWuW2,'H). (14) 

The quantities A^^^(x, ^2,^), ft\x,y,W2,n), and ^^m\Wi,W2,n) are specified as follows. 

For k = 1, t^m {x,y,W2,'H) is simply an indicator for whether every h £ H has h{x) = y, 
while A'it\x,W2,n) is an indicator for whether x G DIS(?^). Formally, they are defined as 
follows. 

t^^\x,y,W2,n) = 1 n {h{x)}{y)- 

hen 

A«(x,M^2,^) = 1dis(«)(x). 
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For /c > 2, we first define 

{nrfi 

Then we take the following definitions for T^'^^ and A^'^). 

Mm \rL) j=i 

Kt\x,W,,U) = — i— Y: ^sHm (5f ^ U {x}) . (16) 
Mm [ri) j=i 

For the remaining estimator, for any k we generally define 

i=l 



The above definitions will be used in the proofs below. However, there are certainly viable al- 
ternative definitions one can consider, some of which may have interesting theoretical properties. In 
general, one has the same sorts of trade-offs present whenever estimating a conditional probability. 

For instance, we could replace "m^" in (031) and (fT6l ) by min |£ G N : Mg''\7i) = and then 

normalize by instead of Mm^T-L); this would give us samples from the conditional distri- 
bution with which to estimate the conditional probability. The advantages of this approach would 
be its simplicity or elegance, and possibly some improvement in the constant factors in the label 
complexity bounds below. On the other hand, the drawback of this alternative definition would be 
that we do not know a priori how many unlabeled samples we will need to process in order to cal- 
culate it; indeed, for some values of k and T-L, we expect "P^^^ = 0, so that M^'^\'H) is 
bounded, and we might technically need to examine the entire sequence to distinguish this case from 
the case of very small P^^^ Of course, these practical issues can be addressed with 
small modifications, but only at the expense of complicating the analysis, thus losing the elegance 
factor. For these reasons, we have opted for the slightly looser and less elegant, but more practical, 
definitions above in (ITSl) and (fT6l) . 



B.2 Proof of Theorem IS 

At a high level, the structure of the proof is the following. The primary components of the proof 
are three lemmas: [34l [37l and [38] Setting aside, for a moment, the fact that we are using the 
Em estimators rather than the actual probability values they estimate. Lemma [38] indicates that 
the number of data points in grows superlineai^ly in n (the number of label requests), while 
Lemma [37] guarantees that the labels of these points are correct, and Lemma [34] tells us that the 
classifier returned in the end is never much worse than Ap{C^^,). These three factors combine to 
prove the result. The rest of the proof is composed of supporting lemmas and details regarding 
the Pm estimators. Specifically, Lemmas [35] and [36] serve a supporting role, with the purpose of 
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showing that the set of l/-shatterable A;-tuples converges to the A;-dimensional shatter core (up to 
probabihty-zero differences). The other lemmas below (|39l-l45]) are needed primarily to extend 
the above basic idea to the actual scenario where the Pm estimators are used as surrogates for the 
probability values. Additionally, a sub-case of Lemma |45] is needed in order to guarantee the label 
request budget will not be reached prematurely. Again, in many cases we prove a more general 
lemma than is required for its use in the proof of Theorem |6l these more general results will be 
needed in subsequent proofs: namely, in the proofs of Theorem [16] and Lemma l26l 
We begin with a lemma concerning the ActiveSelect subroutine. 

Lemma 34 For any k* , M, G N with k* < N, and N classifiers {/ii, /i2, . . . , /iat} (themselves 
possibly random variables, independent from {Xm-,Xm+i-, ■ ■ ■}), ActiveSelect ({/ii, /i2) ■ ■ ■ , ^A^}, 
m, {Xm-,Xm+i, ■ ■ •}) makes at most m label requests, and ifh^ is the classifier it outputs, then 
with probability at least 1 — eN ■ exp {— m/ (72A;* A In(eA))}, we have er(/i^) < 2 er(/ifc* ). o 



Proof This proof is essentially identical to a similar result of Balcan. Hanneke. and Vaughanl ( 201Cl|) . 
but is included here for completeness. 



Let Mk 
it most m 
Step 2 yields 



k{N-k)in(eN) ' ^^^^^ ^'^^^ ^^at the total number of label requests in ActiveSelect 
is at most m, since summing up the sizes of the batches of label requests made in all executions of 



7V-1 N 

E E 

j=l k=j+l 



m 



j{N-j)ln{eN) 



N-1 



Em 
< 
jln(eA) ~ 



Let k** = argmin^jgji^ er(/ifc). Now for any j G {1, 2, . . . , k** — 1} with V{x : hj{x) / 
hk** {x)) > 0, the law of lai^ge numbers implies that with probability one we will find at least Mj 
examples remaining in the sequence for which hj{x) 7^ hk**{x), and since er(/ifc**|{x : hj{x) 7^ 
< 1/2, Hoeffding's inequality implies that P (jn^..^ > 7/12) < exp{-Mj/72} < 
exp{l — m/ (72A;* Aln(eA))}. A union bound implies 

max mfc"j > 7/12 ) < k** ■ exp{l - m/ (72^ Aln(eA))} . 

In particular, note that when maxj<fc** mk**j < 7/12, we must have k > k**. 

Now suppose j G {k** + 1,...,A} has er{hj) > 2er(/ifc**). In particular, this implies 
er{hj\{x : hk—{x) 7^ hj{x)}) > 2/3 and V{x : hj{x) 7^ hk—{x)) > 0, which again means (with 
probability one) we will find at least M^** examples in the sequence for which hj{x) 7^ hf^**{x). 
By Hoeffding's inequality, we have that 

P(mjfc*. < 7/12) < exp{-Mfc«/72} < exp {1 - m/ (72A;* A In(eA))} . 

By a union bound, we have that 

P(3j > k** : ei{hj) > 2er(/ifc..) and m^fc.. < 7/12) 

< (A - k**) ■ exp {1 - m/ (72^ Aln(eA))} . 

In particular, when k > k**, and rrijk** > 7/12 for all j > k** with er(/ij) > 2er(/ifc..), it must 
be true that er(/i^) < 2er(/ifc**) < 2er(/ife.). 
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So, by a union bound, with probability > 1 — eN ■ exp {—m/ {72k*N\n{eN))}, the k chosen 
by ActiveSelect has er(/i^) < 2er(/ifc. ). ■ 

The next two lemmas describe the limiting behavior of S''{V*i). In particular, we see that its 
limiting value is precisely d^f (up to probability-zero differences). Lemma [35] establishes that 
S''{Vj^) does not decrease below dj^f (except for a probability-zero set), and Lemma[36lestablishes 
that its limit is not larger than d^f (again, except for a probability-zero set). 



Lemma 35 There is an event H' with ¥{H') = 1 such that on H', Vm G N, VA; G {0, . . . , — 1}, 
for any Ti with V*,Q'H^C, 



[s\n) d'cf) = [dy d^f) = 1, 

and 

V.EN,l,.,(5r^))=l,.,(5r)). 
Also, on H', every such U has (s^/) = (d^f), and {%) ^ oo as I ^ oo. 



k ( Pik 



Proof We will show the first claim for the set V^, and the result will then hold for % by mono- 
tonicity. In particular, we will show this for any fixed k G {0, . . . , — 1} and m G N, and the 
existence of H' then holds by a union bound. Fix any set S G d^f. Suppose Bv;*(/, r) does 

not shatter S for some r > 0. There is an infinite sequence of sets {{hf.hf,. . . , }}i with 



Vj < 2'=, V{x : hf{x) 7^ f{x)) i 0, such that each {h^',. . . C B(/,r) and shatters 5. 

Since By^ (/, r) does not shatter S, 



1 = inf 1 



inf 1 



3j : hf (Zm) + f (Zm) 



But 



P (^inf 1 [3j : hf (Zm) + f (Zm)] = < inf P (3j : hf / / (Zm)) 



< lim y 7nV (x : hf {x) / f{x)] = ^ m lim V ix : hf {x) / f{x)] = 



7<2'= 



where the second inequality follows from the union bound. Therefore, Vr > 0, 

P (5 ^ 5'^ (Bv* (/,r))) = 0. Furthermore, since (By^ (/, r)) is monotonic in r, the dominated 

convergence theorem give us that 



limP(5^5'= (By^(/,r))) =0. 
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This implies that (letting S ~ be independent from V^) 



f{v' [By 



dy]>o 



r'^{dLfndy]>o 



< lim -E 

= lim 
= lim = 



(Mai-kov) 
(Fubini) 



0. 



This establishes the first claim for T^*, on an event of probability 1, and monotonicity extends the 
claim to any Ji 3 V^. Also note that, on this event, 



where the last equality follows from the first claim. Noting that for ^ C C, 5^/ C 9^/, we must 
have 

This establishes the third claim. From the first claim, for any given value of i G N the second claim 
holds for Sf^ ' (with Ti = F^) on an additional event of probability 1; taking a union bound over 
all i G N extends this claim to every S^'^^ on an event of probability 1. Monotonicity then implies 



Igkf ( S. 



(fc+l) 



1 



s: 



(k+i) 



< 1 



(fc+1) 



extending the result to general H. Also, as A; < dj, we know (Oy) > 0, and since we also 
know V*^ is independent from W2, the strong law of large numbers implies the final claim (for V^) 
on an additional event of probability 1; again, monotonicity extends this claim to any H 5 V^. 
Intersecting the above events over values m G N and k < dj gives the event H', and as each of 
the above events has probability 1 and there are countably many such events, a union bound implies 
¥{H') = 1. ■ 



Note that one specific implication of Lemma [35l obtained by taking /c = 0, is that on H', 

/ (even if / G cl(C) \ C). This is because, for / G cl(C), we have = so that 

= 1, which means P° (dy^f) = 1 (on H'), so that we must have = X^, which 

implies 7^ 0. In particular, this also means / G cl {V*^). 

Lemma 36 There is a monotonic function q[r) = o(l) (as r — )■ Oj such that, on event H', for any 
k £ |o, . . . , (J/ - l|, m G N, r > 0, and set U such that V*^ C ^ C B(/, r), 

V^idy SH'H)]<q{r). 



In particular, for r G N and 6 > 0, on Ht-{5) D H' (defined above), every m > t and k G 



{0, . . . , df - 1} has [dy\s' (y^,)) < qicPir; S)). 
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Proof Fix any k G |o, . . . , — l|. By Lemma [35] we know that on event H', 



{By n {%)) ^ {By n 5^= {H)) 



{By) 



^ {By n {%)) ^ V' {By n 5^ (b (/, r))) 

r^By) - v^{By) 

Define qk{r) as this latter quantity. Since {By n 5*^ (B(/, r))) IS monotonia m r, 



™o ^M^c/) ^M^c/) ^M^c/) 

This proves (?fc(r) = o(l). Defining 

g(r) = max|g'fc(r) : /c G |o, 1, . . . , J/ - ijj = o(l) 

completes the proof of the first claim. 

For the final claim, simply recall that by Lemma |29j on Hr{5), every m > t has V^^ C V* C 
B{f,4>{r;d)). m 



Lemma 37 For (" G (0,1), define 

= sup{r G (0, 1) : q{r) < (} /2. 
On H', V/c G |o, . . . , J/ - l|, VC G (0, 1), Vm G N, for any set % such that C C B(/, r^), 

v(x:V^ (s'^{U[{xJ{x))]) S'^{U)]>C\ 



V[x:V^S^ {n[{xj{xm 



By]>c 



0. (17) 



In particular, for 5 G (0, 1), defining t{(; S) = min < r G N : sup 0(m; 6) < >, /or a«j r > 

[ m>r J 

r(C; S), and any m > t, on Hr{d) D H', (ITtI ) holds for % = V^. o 

Proof Fix /c, m, 7^ as described above, and suppose q = V^ {By\S^{T-L)) < C; by Lemma |36l this 
happens on H'. Since, dy^f C S^{'H), we have that \/x G Af, 

T''^ (5'^ {n[{xj{x))]) \s\n))=v'' {sHn[{xj{x))]) \dy) v'^ [By\sHn)) 

+ v' (S'' mix, fix))]) \s''{n) n By) (By\s\n)) . 

Since all probability values are bounded by 1, we have 



v''[s^{n[{xj{x))]) s^{n)]<vHs^{n[{xj{x))]) 



By]+vHBy 



s^n)]. (18) 
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Isolating the right-most term in (1181 ). by basic properties of probabilities we have 



(By s^{n) 

= (By s^{n) n By] (By s''{n)) + v'' (By s^{n) n By] (ay s^{n) 



<vHBy 



s^{n)) + (By s^{n) n By 

By assumption, the left term in ( [T9l ) equals q. Examining the right term in ( fT9l ). we see that 



(19) 



vHBy 



s\H) nBy)= (s^n) n By By] /v' {s^n) 



By 



<v^By 



By)/r'[By 



By). 



(20) 



By Lemma[35l on H' the denominator in (l20l) is 1 and the numerator is 0. Thus, combining this fact 
with ([HI) and (O, we have that on H', 



V(x:V'(S'in[{x,fixm S'in)) X) <V(x:V'(S'{'H[ix,f{xm dy)>c-q 

(21) 

Note that proving the right side of (|2T] ) equals zero will suffice to establish the result, since it upper 
bounds both the first expression of ( fTTl ) (as just established) and the second expression of ( fTTl ) 
(by mono tonicity of measures). Letting X ~ P be independent from the other random variables 
(Z, Wi, W2), by Markov's inequality, the right side of (|2T]) is at most 



1 



-E 



v^s^ {n[{xj{x))]) 



By 



n 



E 



{s'^ mix, fix))]) n By) 



n 



{Q-q)V^ {By) 



and by Fubini's theorem, this is (letting S ~ be independent from the other random variables) 



E 



ls.AS)V {x:S^S^ mx,f{x))])) 



n 



(c-(z)p^ [By) 



Lemma [35] implies this equals 



E 



lay{S)V (x:S^5^ {n[{x J {x))])) 



n 



iC -q)VHBy) 
For any fixed S € Bl^f, there is an infinite sequence of sets 

with Vi < 2>', V (x : hf {x) / f{x)) i 0, such that each j/i^, . . . , /ij^} 
^{[{x, fix))] does not shatter S, then 



(22) 



, h)^l !> C and shatters 5". If 



1 = inf 1 



3i:/if ^^[(x,/(x))] 



inf 1 



3j:hf{x)^fix) 
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In particular, 



v{x:SiS^ {n[{x, /(x))])) <V\^: inf 1 [Bj : hf {x) / f{x) 

= v{^[x: 3j : hf {x) / /(x)}^ < inf P (x : 3j s.t. /if (x) / /(a 

< lim y P fx : /iS'^(x) / /(x)) = y lim P fx : hf {x) / /(x)) = 0. 

Thus ((22l) is zero, which establishes the result. 

The final claim is then implied by Lemma [29l and monotonicity of in m: that is, on Hr{5), 
V^. C ^ B(/, 0(r; 5)) C B(/, r^). ■ 



Lemma 38 For any G (0, 1), there are values |Ai^^(e) : n S N, e G (0, 1)| such that, for any 
n G N and e > 0, on event i^L„/3j (e/2) n H', letting V = V^^/3j, 

V fx : P'^'z-i (S G A^"^'^-^ : 5U {x} G cS'^'^ (F) S^f-^{V)] > c) < A(5^(e), 



and for any fi-valued N {e) = ti;(log(l/e)), Aj^|^^(e) = o(l). o 
Proof Throughout, we suppose the event ^/^[n/sj H H' , and fix some ^ G (0, 1). We have Vx, 

V^i~^ (S G X'^f-^ : 5U {x} G S'^f {V) S^f~^{V) 



-df-1, 



5?- 






























(23) 



By Lemma [35] the left term in (|23l ) equals 

P*^/^^ ( S G X^f~^ : 5U {x} G S^f{V) S^f-\V) D dt^~^ f ] V^f^^ ( S'^f~\V) 



= V^i'^ [S G : 5 U {x} G S'^i {V) 

and by Lemma l36l the right term in (l23l ) is at most [ra/3j ; e/2)). Thus, we have 

P fx : V^f-^ (S G X^f-^ : 5 U {x} G S^f {V) S^i-^{V)] > C 



<V[x: V^f-^ ( S G X^'f-' : 5U{x} G S^^f (V) 



df-l 



dT'f > c 



n/3\;e/2))] . (24) 
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For n < 3t{(/2; e/2) (for r(-; •) defined in Leninial37l). we define A^^^ (e) = 1. Otherwise, suppose 
n > 3T(C/2;e/2), so that g(0([n/3j ; e/2)) < C/2, and thus ^ is at most 

r{x: V"!^^ ( S e X^f~^ : 5 U {x} G S^i {V) dt^"^ f ) > C/2 ) • 



By Lemma [29] this is at most 



V[x: Vi-' S £ X"!-^ : SU {x} G S^f (B(/, <^([n/3j ; e/2))) 



5? 7 ) > C/2 



Letting X ~ by Markov's inequality this is at most 
2 



-E 



V^i'^ [ S £ X'^s-^ :5U{X} G5'^/(B(/,(/)([n/3j;e/2))) 
2 



c 



< p^/ 5^/ (B(/,0(Ln/3j;e/2))) . 



^P'^/ (5u{x} G A"^/ :5u{x} G 5°^/ (B(/,(/.([n/3j;e/2))) and 5 G V 
C(5/ 



(25) 



Thus, defining /S.n\e) as (l25l ) for n > 3r(C/2; e/2) establishes the first claim. 

It remains only to prove the second claim. Let A^(e) = tj(log(l/e)). Since T(C/2;e/2) < 

= 0(log(l/e)), we have that for all sufficiently small e > 0, 



N{e) > 3r(C/2;e/2), so that Aj^L(e) equals ^ (with n = iV(e)). Furthermore, since 6f > 0, 



V^f {8^.^/] =0, and (/.([7V(e)/3j ; e/2) = o(l), by continuity of probability measures we know 



25] ) is o(l) when n = A^(e), so that we generally have A^|^j(e) = o(l). 



For any m G N, define 



M(m) = m^5f/2. 



Lemma 39 There is a {C,V, f)-dependent constant c^^^ G (0, oo) such that, for any r G N there is 

(i) I 

an event Hr C with 



F (fW) > 1 - c(*) • exp|-M(r)/4} 



such that on Hr \ if df > 2, then V/c G |2, . . . , Vm > t, \/i £ N, /or an}' set % such that 

VI cnc c, 

M^^ {%) > M{m). 
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Proof On H', Lemma [35] implies every l^fe-i^^^ ('S'^^^ ) > la'^^V {^^ 



),so 



we focus on showing 



|s'|'^^ : i < n 5^ ^/ > M{m) on an appropriate event. We know 



Vfc G |2, . . . , dj| , Vm > T, 



> M{m] 



= 1 - P G {2, . . . , J/} , m > T : j^f ^ : i < m^} n V < M{m] 
> 1- X^'^dl'^J''^ ■■ i < m^] ri d^-^ f\ <M{m] 

m>T k=2 

where the last line follows by a union bound. Thus, we will focus on bounding 

df 



m>T k=2 



■.i<m^}nd!^~^f 



< M{m) I . 



(26) 



Fix any k G 



{2,..., J/}, 



and integer m > t. Since 



E 



a Chernoff bound implies that 



< exp|-m^(^//8} . 



Thus, we have that (1261) is at most 



^ ^exp|-m35//8| < ^ J/ -expl-m^f^z/sj < ^ J/ • exp |-m^//8| 

m>T fc=2 m>r m>T^ 

< df ■ exp I -M(r) /4j +df J exp | -xdj/sj dx 

= J/ • (^1 + 8/(^/) • exp |-M(r)/4} 

< (9df/6f^ •exp|-M(r)/4} . 

Note that since F{H') = 1, defining 

ijW = jvA; G {2, . . . , Vm > r, jsf ^ : i < m^j n 5^"V > M(m)} n H' 
has the required properties. 
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(i) 

Lemma 40 For any r G N, there is an event G)- with 

F [h^^ \ G^.*)) < {l2ldf/~5f^ ■ exp {-Af (t)/60} 

such that, on Gt\ if df > 2, then for every integer s > t and k G |2, . . . , df^, Vr G (O, "("i/e]. 

) (B (/, r)) < (3/2) I {^f ■.i<s']n d^'f • 



Proof Fix integers s > t and k G |2, . . . 

{^f^ i < n cS'=-i(B(/,r)). Note 



and let r = ri/g. Define the set 5^ ^ = 
A'/i''^ (B (/,r)) and the elements of S''^^ 



are conditionally i.i.d. given Ms (B (/, r)), each with conditional distribution equivalent to the 



conditional S 



(k) 



l^(fc) e 5^-1(6 (/,r))}. In particular, E \S''-^ D d^'^ f\ A^f^(B(/,r)) 



cS'^-i (B (/, r)) ) Afi''^ (B (/, r)). Define the event 



< (3/2) 



} 



By Lemma [36l (indeed by definition of q{r) and r^/g) we have 
1-p(g«(A;,s)|a//W (B(/,r))^ 

= P n VI < (2/3)Mf ) (B (/, r)) |Mf) (B (/, r))] 

< P (icS^-i n VI < (4/5) (1 - q (r)) Af « (B (/, r)) | AfW (B (/, r))^ 

<p(|cS^-in9^-V| < (4/5)^'=-! (9^-V|5^"-VB(/,r))) Af«(B(/,r))|Mf)(B(/,r))). 

(27) 

By a Chemoff bound, (l27l ) is at most 



exp{-A^W (B(/,r))P^-i (5^-V|5^"' (B(/,r))) /5o} 

< exp {-A^i'^) (B (/,r)) (1 - q (r)) /5o} < exp {-M^'^^ (B (/,r)) /6o} . 

Thus, by Lemma [39l 



P \ G^\k,s)j < P (|A^f ) (B(/,r)) > A^(s) 
= E [(l -p(G«(A:,.)|Af« (B(/,r)))) (Aff) (B(/,r)); 

< E [exp {-A//W (B(/,r))/60} 1[m(.),oo) (^i'^ (B(/,r)))] < exp {-A^(s)/6o} 
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Now defining G? = n,>, [^=2 G^r\k, s), a union bound implies 
P (^H^'^ \ <^df exp |-M(s)/60} 

S>T 

< d/ ^exp|-M(r)/60} + j exp |-x^//12o| dx 
= df(^l + 120/5/) • exp |-M(r)/60| 

< (l2ldf/5f^ •exp|-M(r)/60| . 

This completes the proof for r = ri/g- Monotonicity extends the result to any r G (O, J^i/e] ■ ^ 

Lemma 41 There exist {C,V, f, 'y)-dependent constants r* G N and c^"-* G (0, oo) such that, for 

(a) (i) 

any integer r > r*, there is an event Hr ^ Gr with 

P (h^^ \ < c(") • exp |-M(r)i/V60} (28) 

i'Mc/i f/ia?, on Hr n -ffr , 'is,m,l,k G N w/?/j i < m and k < df,for any set of classifiers % with 
VI C %, if either k = 1, or s > t and H C B(/, r(i„^)/g), then 

In particular, for 6 G (0, 1) and r > max{r((l — 7)/6; 6), r*}, on Hr{5) Pi //r*^ Pi if-f"^, f/i''^' is 
true for % = for every k,i,m,s G'N satisfying t < i < m, t < s, and k < dj. o 



/ - \l/3 

Proof Let r* = (6/ (1 — 7)) • ( 2/5/ j , and consider any r, k, £, m, s, % as described above. If 

A; = 1, the result clearly holds. In particular, Lemma |35] implies that on Hr \ Ti^Xm, f{Xm))] 5 
/ 0, so that some h ^T-L has h{Xm) = f{Xm), and therefore 

f« {Xm,-f{X^),W2,n) = l^ |;,(X„)|(-/(X^))=0, 



and since Ai^-* {Xm., W2,'H) = IdisCH) {^m), if {Xm, W2,'H) < 7, then since 7 < 1 we have 
Xm i BlSin), so that 

fW {XmJ{Xm),W2,n) = lf^ |;,(X„)|(/(X„)) = L 

hen 
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Otherwise, suppose 2 < k < df. Note that on H^'^ n G^r\ Vm G N, and any H with 1// C 
n C B(/, r(i„^)/6) for some £ G N, 

1 



< 



< 





i < 




V| 




1 








i < S^^ 








1 







i=l 
,3 



-s^-Mv^) i^i'^j l5^-i(B(/,r-(,_,,/«)) [Sr) (monotonicity) 



(fc) 



=3 



< 



3 

2Afi'=)(B(/,r(i_,)/6)) 



(B(/,r(i-^)/6)) l.'^i 



(fc) 



(fc) 



(monotonicity) 
(Lemma [351) 
(Lemma [40l) 



2=1 



For brevity, let f denote this last quantity, and let M^s = Mg''^ (B (/, r(i_^)/g)) . By Hoeffding's 
inequality, we have 



(2/3)f > V'^-' fe^V S>^-' (B (/,r(i_,)/6))) + 



-1/3 



Mks < exp -2M; 



rl/3 



} 



Thus, by Lemmas [36l[3l and [40l 

^{(2/3)f W (X^, -fiX^), W2,n) > q (r(i_^)/6) + Af (sj^Vsj p ^« p g« 



= E 
< E 



(2/3)f > P^-i (a^^V 5'=-^ (B (/ 



S'^' (B (/, r(i_,)/6))) + M(.)-i/3} n 
S'^-' (B (/, r(i_^)/6))) + M^y'] n > Af(,)} 

1/3 



,^(l-7)/6 



(Mks)] <exp{-2M(^ 



Mks 1 



[A/(s),oo) 



(Mk 



Thus, there is an event H^''\k, s) with P (^H^'^ n ci''^ \ //^**^(A;, s)) < exp |-2A/(s)V3| such 
that 

ff) (X^,-/(X^),W^2,^) < (3/2) (r(i„^)/6) +M(s)-i/3^ 
holds for these particular values of k and s. 
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,s . 



To extend to the full range of values, we simply take hI"'^ = Gr^ n f]gy^ ClkKdf -f^r**'' ( 
Since r > {2/5 f^^^, we have M(r) > 1, so a union bound implies 

< df ■ ^exp |-2M(r)^/^} + j exp |-2M(rE)i/3| dx 
= df [l + 2-2/3^-1/3^ . exp |-2M(t)1/3| < 2df~5]^''^ ■ exp |-2M(-; 
Then Lemma l40l and a union bound imply 

P \ Hf^^ < 2df6J^^^ ■ exp |-2A7(r)i/3| + 121^/57^ • exp |-A7(r)/6o} 
< 123d f6j^ ■ exp |-M(t)1/V60} . 

, every such s, m, £, k and H satisfy 

f« {Xm,-f{X^),W2,n) < (3/2) (g(r(i_^)/g) + M(5)-i/3) 

< (3/2) ((1 - 7)/6 + (1 - 7)/6) = (1 - 7)/2, (29) 

where the second inequality follows by definition of r(i_^)/g and s > t > t*. 
If Ai*^^ (X^,W2,?^) <7,then 

1 

Finally, noting that we always have 

we have that, on the event H^-'^ n Hi''\ if Ai''^ {Xm, W2,n) then 

(X^,-/(X^),M^2,^) 

< (l-7)/2 = -(l-7)/2 + (l-7) by® 

< -(1 - 7)/2 + E (^f ^sHm (^f ^ U {X^}) by m 

Ms [H) j=i 

--''-^''''+x?jb;T^>^>-(«)(s;")''5'- 

Ms [H) j=i 



+ , .(fc) E l-s'^-Mw) (-^f^) '^s''-Hni{x,^,-f{x^m (-^f^ 

= -(1 - 7)/2 + fi'^) (X„, -/(X^), 7^) + f {Xm, f{Xm),W2,n) 

<ri''^x^j{x^),w2,n). by dag 
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The final claim in the lemma statement is then implied by Lemma |29J since C V* C 
B (/, </.(r; 6)) C B (/, m^^ye) on Hr{6). ■ 



For any k,£,m ^ N, and any x X, define 

p,(A:,Am) = AW {x,W2,V;) 

p^{k, £) = p^-i (s G X''^^ :SU{x}eS^ {VI) S^^^ (V^] 



Lemma 42 For any ( £ (0, 1), there is a (C, V, /, C,)-dependent constant c*^'"*^ (C) G (0, oo) 5mc/i 
f/iaf, for any r E N, ?/jere an event wjY/j 

P (h^^ \ H^'''\C)) < c(™)(C) • exp {-C^MCr)} 

5Mc/i f/iaf on Hr^ n Hr^^^ (^), V/c, ^, m E N w/f/i t < £ < m and k < dj, for any x £ X, 

V{x : \p^{k4) - Px{kA,m)\ > C) < exp|-C^M(m)} . o 

Proof Fix any /c, £, m E N with t < I <m and /c < dj. Recall our convention that = {0} and 
{X^) = 1; thus, if A; = I, px{k,£,m) = 1dis(v*)(^) ~ l5i(y*)(^) = so the result 

clearly holds for A; = 1. 

For the remaining case, suppose 2 < k < dj. To simplify notation, let rh = i^e)' 
X = Xi^i, px = Px{k, £) and px = Px{k, I, in). Consider the event 

{kJ.rn.O = [V {x ■.\px - Px\> <^^v{-Q''M{m)]] . 



We have 



H^)\H^ii'\k,£,m,C) V; 
< P ( <! m > M(m) I \ (A;, i, m, () 



(31) 



VA (by Lemma [391) 



(^|m > M(m)| n |l 



^sm\px-px\ y ^sfhC, 



W2, VA > e-^'*^^'") I I VA , (32) 



for any value s > 0. Proceeding as in Chernoff 's bounding technique, by Markov's inequality 
is at most 



P (^|m > M(m)} n |e-''^^E 
< P (||m > M(m)| 



„sm\px-Px\ 



W2,V; 



> e 



n <^ e 



„sm{px-px) _j_ gSrh{px-px) 



E 



4M(m),oo) im)FU 



^sm{px-Px) _|_ gSrh{px-px) 



W2,Vl 
W2.V0 



}V') 



> e 



rh,V^ 
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By Markov's inequality, this is at most 



E 



11 - f™W^^*(™)F 



E 



E 



[M(m), oo) 



(m) e^'^^^'^^e-^'^^E 



^sm{px-px) _j_ gSm(px-pjf) 



gSm(px-Px) _j_ Qsrh{px-Px) 



v; 



J^[M(m),oo) '^"^i ^ ^ ^ 



E 



^srh{px-px) _|_ Qsfh{px-Px) 



X, m, W 



(33) 



The conditional distribution of fhpx given (X, m,V^*) is Binomial (m,px), so letting Bi(px), 
B2(px)> • • • denote a sequence of random variables, conditionally independent with distribution 
Bernoulli (px) given (X, m, V^*), we have 



E 



^srh{px-px) _j_ gSm{px-px) 



£ IgSr'nipx-px) 



X, m, VI 



X, m, Vt 
+ E 



E 



n 

.4 = 1 



,s(Px-Bj(px)) 



X, m, 
X m, W 



+ E 



j-j-gS(B,(px)-Px) 



fi=i 

+ E ^e«(Bi(px)-Px) 



X, m, VI 
X, in, Vo 



(34) 



It is known th at for B ^ Bernoulli(p), E [ e^^^ ^^j and E [e*^^ are at most e*^/^ (see e.g.. 
Lemma 8.1 of lDevroye. Gyorfi. and Lugosilll996h . Thus, taking s = 4^, ( [34l ) is at most 2e^"^^ , 
and (l33l) is at most 



E 



1 



[M(m), oo) 



(m) 2e 



f2M(m)g-4mC2g2mC^ 



E 



< 2exp 



-C2M(m)| . 



Since this bound holds for (|3T] ). the law of total probability implies 



E 



v; 



< 2 • exp 



|-C^M(m)| 
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Defining Hr^^\() = n^>r nm>^ 0^12 ""t-i C). we liave tlie required property for the 
claimed ranges of k, £ and m, and a union bound implies 



<2dfJ2 (exp |-C^M(£)} + exp |-xC^<5//2} dx 
= 2df ■Y.[l + 2C%') ■ exp {-C'MW} 

< 2df ■ (l + 2C"^^7^) • (exp |-C^M(t)} + j exp i^-xC'^6f/2^ dx^ 

= 2df ■ (l + 2C'^6j^y ■ exp |-C^M(t)} 

< ISd/C"^^^^ • exp |-C^M(r)} . 

■ 

For k,£,meN and C G (0, 1), define 

Pt;{k,e,m)=V{x:p^{k,£,m)>C). (35) 

Lemma 43 For any a, C, (5 G (0, 1), /3 G (0, 1 — ^/a], and integer r > r(/3; 5), on -ffr(<5) H n 
i^r (/3C), for any k,£,i',men with t <£<£'< m and k < df, 

p^{k,£',m) <V{x: Px{k,£) > aC) + exp ^- /3'^ C"^ M (m)^ . (36) 



Proof Fix any a, C, <5 G (0, 1), f3 £ {0,l - y/^], r, k, £, m G N with t(/3; 6)<T<£<£'<m 
and k < df. 

If k = 1, the result clearly holds. In particular-, we have 

Pc{l,£',m) = V (DIS (y;)) < V (DIS (y/)) = P (x : p^{l,£) > aC) . 

Otherwise, suppose 2 < k < df. By a. union bound, 

p^{k,£',m)=V{x:p^{k,£',m) > C) 

<V {x:p,ik,£')>V^C)+V{x: \p,{k,£') - p,{k,£' ,m)\ > (1 - ^)C) . (37) 

Since 

V{x: \p^{k,£')-p,{k,£',m)\ > {I - V^)C) <V{x:\p^{k,£')-p,{k,£',m)\ >/?(), 
Lemma |42] implies that, on H^'^ n Hf'\l3Q), 

V{x: \p^{k,£')-p^ik,£',m)\ > {1 - V^)C) < exp {-/3\2M(m)} . (38) 
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It remains only to examine the first term on tlie right side of OH). For this, if V^~^ {S^^^ (V^t)) = 
0, then the first term is by our aforementioned convention, and thus (l36l ) holds; otherwise, since 



we have 



V [x : p^{k, £') > V^C) =r(x: V''"^ (^S G X^^^ : S U {x} e S'' (F^t) S''"^ {V^1)j > V^Cj 
= r(x: V^^^ [S G X^^^ : 5 U {x} G cS'^ (F^^t)) > V^C^^'"^ (s^^^ (^/))) • (39) 

By Lemma[35]and monotonicity, on Hr^ C H' , (l39l ) is at most 

V (x : v^~^ {s G x^-^ ■.su{x]es^ (y^t)) > ^CV^'^ {dt^f] 

and monotonicity implies this is at most 

V (x : P'^-I (5 G X^~^ : 5 U {x} G 5^= (F/)) > V^C^'^^^ (^c" V) ) • (40) 

By Lemma[36l for r > r(/3; 5), on n 
which implies 

■yk—l I o/c— 1 r\ \ -nfe— 1 / ofc— 1 f cfc— 1 /T^* 



^ [9c~ f) ^ ^ (^c / n ^ i^i 
Altogether, for r > r(/3; 5), on iJrl'^) n Hr \ is at most 

v(x:V''^^(^SeX''~^ :SU{x}€S^ (F/)) > aC^^^^ (5^'"^^*))) = V {x : Pc,{kJ) > aC), 
which, combined with (l37l ) and (l38l ). establishes 



Lemma 44 There are events : T G n} 

P > 1 - Mf ■ exp {-2r} 

such that, for any i G (0,7/16], 5 G (0,1), and integer t > r(™)(^;5), w/7ere r^^^^C; 5) = 

max |t(4C/7; 5), In (^))'^'|, o« i/.(5) n ifj*Mi^p)(0 n /^(^UA; G {1, . . . , d/[ 

V£ G N w/f/j £ > T, 

V{x:p^{k,e) > 7/2) +exp{-72M(£)/256} < Af^ {WuW2,Vl) (41) 

< P(x :p^(A;,£) > 7/8) + 4r^ (42) 
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Proof For any k,i G N, by Hoeff ding's inequality and the law of total probability, on an event 
G^*") {k, e) with P {k, £)) > 1 - 2 exp {-2£}, we have 



£3 



i=l 

(iv) _ ^ f-^df 



<i-^. (43) 



Define the event Hr' = f]£>r ftU ^^"'^ (k, i). By a union bound, we have 
1 - P < 2d7 • ^ exp {-21} 

< 2df ■ ^exp {-2r} + y exp {-2x} dxj = 3(i/ • exp {-2r} . 

Now fix any i > t and k £ |l, . . . , By a union bound, 

V {x : p,{k,i) > -f/2) <V {x : p^(k,i,£) > j/A)+V {x : \p,{k,i) - p,{k,i,£)\ > 7/4) . (44) 
By Lemmalia on H^'^ n 

V{x:\p,{k,i)-p,{k,i,£)\ >-f/A)<V{x:\p^(k,i)-p^{k,£,£)\ > < ew {-^Mii)}. 

(45) 

Also, on Hr'"\ (|43] ) implies 

>7/4) =P-,/4(fc,A^) 

= Af)(W^i,M^2,V^/)-r^ (46) 
Combining (011) with (051) and (06l) yields 

V {x : p,(A:,^) > 7/2) < Af ^ (t^i, 1^2, V^) - + exp {-e'M(£)} . (47) 

Forr > r(*^)(^;<5), exp|-^2^/(^)| - i'^ < - exp |-72m(^)/256}, so that (071) implies the 
first inequahty of the lemma: namely (|4T1) . 

For the second inequality (i.e., (I42l)). on Hr^\ ( |43l ) implies we have 

Af ) (VFi, V^/) < P^/4(^, ^, ^) + 3ri. (48) 
Also, by Lemma 031 (with a = 1/2, C = 7/4, /3 = ^/C < 1 - ^/S), for r > t(™)(^;5), on 

/f,(5)nif|'Mi/|™\0> 

PV4(fc>^>^) <^(^:P--(^>^) >7/8)+exp{-e'M(^)}. (49) 
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Thus, combining (148]| with (|49]l yields 

Af ^ {Wi,W2, Vt) <V{x: p,{k,£) > 7/8) + + exp ^-^^M{i)j . 

For r > r(™)(^; <5), we have exp < which establishes ■ 

For n S N and S {1, . . . , d + 1}, define the set 

^^'^ = {m„ + 1, . . . , m„ + [n/ (g • 2'=A(^) (T^i, T^2, ^)) J } , 

where m„ = [n/3j; Un^^ represents the set of indices processed in the inner loop of Meta- 
Algorithm 1 for the specified value of k. 

Lemma 45 There are (/, C, V, ^)-dependent constants ci, £2 € (0, 00) such that, for any e S (0, 1) 
and integer n > ci ln(c2/e), on an event Hni^) with 



F{Hn{e)) > 1 - (3/4)e, 

we have, for V = V* , 

VA: G {1, . . . , J;} , I {m G ZY^'^) : AW(X^, W^, V^) > 7} | < [n/ (s • 2^) 

Alii\Wi,W2,V) < A(7/8)(e) +4m-\ 

and Mm G 



(50) 

(51) 
(52) 



Al^'\Xrn,W2,V)<-f^rl^'\x^,-f{Xrn),W2,V)<tlt'\Xrn.,f{X^),W2,V). (53) 



Proof Suppose n > ciln(c2/e), where ci = max 



df +12 



2"/ 



24 24 



max 



4 (c» + c(-) + c(™) (7/16) + 6d7) , 4 (^) , 4 (- 



5f-y2 ' '-(1/16)' '-(l-7)/6' 



3r* > and C2 



4e 



-7)/6 



In particular, we have 



chosen ci and £2 large enough so that 

rUn > max {r(l/16; e/2), r^'"') (7/I6; e/2), r((l - 7)/6; e/2), r* } . 
We begin with (ISTl) . By Lemmas |43] and l44l on the event 

H^'He) = HmAe/2) n n ^(1^(7/16) n H^^ , 

{k, uin, m) <V [x : Px{k, m„) > 7/2) + exp |— 7^M(m)/256| 

<V{x: px{k, rrin) > l/2) + exp {-72M(m„)/256} < A^^) (^1, W2, V) . 

(54) 
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n/(6 • 2^h.tl {Wi,W2,V))\, conditionally 



Recall that | Xm ■ rn G W^'^-* | is a sample of size 
i.i.d. (given (Wi, VF2, V)) with conditional distributions V. Thus, V/c G |l, . . . , d/j, on Hn\£), 



m 



Wi,W2,V 



< 



{m G Ui"'^ : AW (X^, W2, V) > j]\ > 2 \ujl^^\ (T^i, W2, V) 



Wi,W2,V 



<FiB(\ui%Al^l{W,,W2,V))>2Ui'^ AliiUWuW2,V) 



Wi,W2,V , 



(55) 



where this last inequality follows from (l54l ). and B(n,p) ~ Binomial(«,p) is independent of 
Wi, W2, V (for any fixed u and p). By a Chernoff bound, (1551 ) is at most 



exp {- [n/ (e • 2'=A(^) (1^1, TVs, V^)) J A^ (TVi, 1^2, l^)/3} < exp {l - n/ (18 • 2'=) } . 

" (2) 

By the law of total probability and a union bound, there exists an event Hn with 
P (i7(i)(e) \ i^P) < d> • exp {1 - n/ (18 • 2'^'/) } 

such that, on M^^(e) n (EB holds. 
Next, by Lemma l44l on H^^ (e), 

A£^(t^i,T^2,^) <^(x:p. {df,mn) >7/8) +4m;\ 

and by Lemma l38l on HrP (e), this is at most An^^^ (e) + 4:m~^, which establishes (l52l ). 

Finally, Lemma gUimphes that on M^^(e) n Hj^}, Mm G ^1'^^'', dSS holds. 
Thus, defining 

^„(e) = F«(e)ni?(2)n//(t^), 

it remains only to establish (l50l ). By a union bound, we have 

1 - P (i^n) < (1 - P (^,n„(^/2))) + (1 - P + P \ 

+ p \ //^:)(7/i6)) + (1 - p + p [m^\e) \ m;^^ 

< e/2 + c(*) • exp |-M(m„)/4| + c^") • exp |-M(m„)^/V6o} 



+ c 



(in) 



(7/I6) • exp + 3df ■ exp{-2mn} 



+ J/-exp|l-n/(^18-2''/)| 

< e/2 + (c(*) + c(") + c(^*^)(7/16) + Gd/) • exp \^-n6fj^2^'^f^^^^ 
We have chosen n large enough so that (l56l ) is at most (3/4)e, which establishes (l50l ). 

The following result is a slightly stronger version of Theorem [6l 



(56) 
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Lemma 46 For any passive learning algorithm Ap, if Ap achieves a label complexity Ap with 
oo > Ap(e, /, V) = (x>(log(l/e)), then Meta-Algorithm 1, with Ap as its argument, achieves a label 
complexity such that Aa(3e, /, V) = o(Ap(e, /, V)). o 

Proof Suppose Ap achieves label complexity Ap with oo > Ap{e,f,V) = a;(log(l/e)). Let 
e G (0,1), define L{n;e) = n/(^6-2'^f (^A^^^^\e) + Am,-^^^ (for any n G N), and let 
L~^{m; e) = max {n G N : L(n; e) < m} (for any m G (0, oo)). Define 

ci = max |ci, 2 • 6^((i + !)(!/ ln(e((i + 1))| and C2 = max{c2, 4e((i + 1)} , 
and suppose 

n > max |ci ln(c2/e), 1 + (Ap(e, f,V);e) |. 

Consider running Meta-Algorithm 1 with Ap and n as inputs, while / is the target function and V 
is the data distribution. 

Letting /i„ denote the classifier returned from Meta-Algorithm 1, Lemma [34] implies that on an 
event ^„ with P(£;„) > 1 - e((i + 1) • exp |- [n/3j /(72(i7(d + 1) ln(e((i + 1)))} > l-e/4, we 
have 

er(/i„)<2er (^p(£jj). 
By a union bound, the event Gn(e) = End Hni^) has P ( Gni^) ) > 1 — Thus, 



E 



er hn 



< E 



< E 



\C^J>Ap{eJ,V) 



er hn 



+ P (G„(e) n {\C^^ I < Ap(e, /,P)}) + P (G„(e)' 



2eT:(Ap[C^^ 



\C^J>Ap{eJ,r) 

+ F (dn{e) n [\C^^, \ < Ap{e, f,V)]) + e. (57) 

On Gn{£:), (l52l ) of Lemma |45] implies > L{n;e), and we chose n large enough so that 

L{n; e) > Ap(e, /, V). Thus, the second term in (|57l ) is zero, and we have 



E 



er I h 



< 2 -E 
= 2 -E 



\Cj\>Apie,f,V) 



+ e 



E 



er ( ^„ f £ 



|£,-J>Ap(e,/,P) 



+ e. (58) 



Note that for any i with P(|£j , 



£) > 0, the conditional distribution of : m ^Un 

given I | = ^| is simply the product (i.e., conditionally i.i.d.), which is the same as the dis- 
tribution of {Xi,X2, . . . , X(}. Furthermore, on G„(e), dST] ) implies that the t < [2n/3j condition 
is always satisfied in Step 6 of Meta-Algorithm 1 while k < dj, and (l53l ) implies that the inferred 
labels from Step 8 for k = dj are all correct. Therefore, for any such £ with £ > Ap(e, /, V), we 
have 



E 



eviApiC^ 



{\^dj\=^}] <^[eT{Ap{Ze))]<e. 
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In particular, this means (l58l ) is at most 3e. This impUes that Meta- Algorithm 1, with Ap as its 
argument, achieves a label complexity such that 



Since Ap{e, f,V) = a;(log(l/e)) =^ciln(c2/e) = o (Ap(e, /, P)), it remains only to show that 
(Ap(e,/,P);e) = o {Ap{e, f,V)). Note that Ve G (0,1), L(l;e) = OandL(n;e) is diverging 
in n. Furthermore, by Lemma |38l we know that for any N-valued N{e) = a;(log(l/e)), we have 
^ivS^(^) = which implies L{N{e);e) = uj{N{e)). Thus, since Ap{e,f,V) = w(log(l/e)). 
Lemma |3T] implies {Ap{e, f,V); e) = o {Ap{e, f,V)), as desired. 

This establishes the result for an arbitrary 7 G (0, 1). To specialize to the specific procedure 
stated as Meta- Algorithm 1, we simply take 7 = 1/2. ■ 

Proof [Theorem [6l Theorem [6] now follows immediately from Lemma |46l Specifically, we have 
proven Lemma l46l for an arbitrary distribution V on X, an arbitrary / G cl(C), and an arbi- 
trary passive algorithm Ap. Therefore, it will certainly hold for every V and / G C, and since 
every {f,V) G Nontrivial(Ap) has 00 > Ap{e,f,V) = a;(log(l/e)), the impUcation that Meta- 
Algorithm 1 activizes every passive algorithm Ap for C follows. ■ 

Careful examination of the proofs above reveals that the "3" in Lemma |46] can be set to any 
arbitrary constant strictly larger than 1, by an appropriate modification of the "7/12" threshold 
in ActiveSelect. In fact, if we were to replace Step 4 of ActiveSelect by instead selecting k = 
argmin^. maxj^fc mfcj (where rrikj = erQ^^ (hfc) when k < j), then we could even make this a 
certain (1 + o(l)) function of e, at the expense of larger constant factors in A^. 

Appendix C. The Label Complexity of Meta- Algorithm 2 

As mentioned. Theorem [TO] is essentially implied by the details of the proof of Theorem [16] in Ap- 
pendix |D] below. Here we present a proof of Theorem [131 along with two useful related lemmas. 
The first, Lemmal47l lower bounds the expected number of label requests Meta- Algorithm 2 would 
make while processing a given number of random unlabeled examples. The second. Lemma |48l 
bounds the amount by which each label request is expected to reduce the probability mass in the re- 
gion of disagreement. Although we will only use Lemma|48]in our proof of Theorem[T3l Lemmal47] 
may be of independent interest, as it provides additional insights into the behavior of disagreement 
based methods, as related to the disagreement coefficient, and is included for this reason. 

Throughout, we fix an arbitrary class C, a target function / G C, and a distribution V, and 
we continue using the notational conventions of the proofs above, such as = {/i G C : Vi < 
m, h{Xi) = f{Xi)} (with Vq = C). Additionally, for t G N, define the random variable 



which represents the index of the t unlabeled example Meta-Algorithm 2 would request the label 
of (assuming it has not yet halted). 

The two aforementioned lemmas are formally stated as follows. 



Aa{3e,f,V) < max 



{ci \n{c2/e),l + L"'iAp{e,f,Vy,e\ 



)}■ 
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Lemma 47 For any r G (0, 1), 

"ri/ri 



E 



m=l 



> 



P(DIS (B(/,r))) 
2r 



Lemma 48 For any r G (0, 1) and n G 



E 



P DIS V, 



' M{n) 



> P(DIS (B(/,r))) - nr. 



Before proving these lemmas, let us first mention their relevance to the disagreement coefficient 
analysis. Specifically, note that when 6f{e) is unbounded, there exist arbitrarily small values of 
e for which P(DIS(B(/,e)))/e w ^/(e), so that in particular P(DIS(B(/, / o(0/(e)). 

Therefore, Lemma |47] implies that the number of label requests Meta- Algorithm 2 makes among 
the first [1/e] unlabeled examples is / o{9f{e)) (assuming it does not halt first). Likewise, one 
implication of Lemma|48]is that arriving at a region of disagreement with expected probability mass 
less than P(DIS(B(/, e)))/2 requires a budget n of at least P(DIS(B(/, e)))/{2e) / a 

We now present proofs of Lemmas |47] and l48l 
Proof [Lemma |471 Since 



E 



[l/rl 
m=l 



^ E [p (Xm G DIS 

m=l 
[l/rl 

^E[P (DIS(F^„i))]' 

m=l 



m— 1 



(59) 



we focus on lower bounding E [V (DIS {V^))] for m G N U {0}. Let D„, = DIS {V^ n B(/, r)). 
Note that for any x G DIS(B(/, r)), there exists some hx G B(/, r) with hx{x) / /(a;), and if 
this hx G y^, then x G as well. This means Vx, 1d^{x) > lDis(B(/,r))(a;) • ly,* (^x) = 
lDiS(B(/,r))(a^) • UT=i ^Dis{{h^j}riXe). Therefore, 



E [V (DIS (y^))] = P (X^+1 G DIS (K;)) > P (X^+1 G 



E 



E 



> E 



E 



E 



Id™ (-'^m+l; 



X 



m+1 



(^X^+iC-'^^) = /(-'^£)p^m+l) lDIS(B{/,r))(^m+l) 



.£=1 



(60) 
(61) 



> E [(1 - r)-lDis(B(/,r.))(^™+i)] = (1 - r)™P(DIS(B(/,r))), 
where the equality in (l60l ) is by conditional independence of the 1dis{{/ix +iJ})''(-^i) indicators, 
given Xm+i, and the inequality in (|6T]) is due to G B(/, r). This indicates (l59l) is at least 



[l/rl 

Ed 

m=l 



[l/rl 



(DIS (B(/, r))) >Y^{l-{m- l)r) (DIS (B(/, r))) 



[l/rl 1 



m=l 

fl/rl 



M^(DIS,B(/,.)))>«2(M)), 
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Proof [Lemmaim For each m G N U {0}, let Dm = DIS (B(/, r) n V^). For convenience, let 
Af(0) = 0. We prove the result by induction. We clearly have E [P (-0^/(0))] = E {Dq)] = 
V(DlS(B{f,r))), which serves as our base case. Now fix any n G N, and take as the inductive 
hypothesis that 

E [V {DMin-i))] > V{BlS{B{f,r))) - (n - l)r. 

As in the proof of Lemma l47l for any x G ^j\/(n-i)' there exists hx G B(/, r) n V^.f(^n-i) ^^^^ 
hx{x) / f{x); unlike the proof of Lemmal47l here is a random variable, determined by y^(^n-i)- 
If /i^ is also in y*^^^-,, then X G as well. Thus, Vx, Iz)^^^^, (x) > lz)^^(^_^, (x)-ly^* (/i^,) = 

^DM(„-i)ix) • lDis({h,,/})'=(-'^M(n))> where this last equality is due to the fact that every m G 
{M{n - 1) + 1, . . . , M(n) - 1} has ^ DIS {V^_i), so that in particular hx{Xm) = f{Xm). 
Therefore, letting X ~ P be independent of the data Z, 



E 



1i5m(„-i,(^) • ^ (hxiXMin)) = fiXuin)) 



X, ^M(n-l) 



(62) 



The conditional distribution of Xjvf(„) given Vj 



M{n-1) 



is merely V, but with support restricted to 



DIS V, 



M(n-l) 



and renormalized to a probability measure. Thus, since any x G has 



DIS({/i,,/}) CDIS 



M(n-1) 



, we have 



P [hx{XM{n)) / f{XM{n)) 



V, 



M{n-1) 



v{ms{{hxJ})) 



P DIS V, 



A/(n-l) 



< 



V D 



A/(n-l)J 



where the inequality follows from hx G B(/, r) and -DAf(n-i) ^ DIS (vM{n-i)) ■ Therefore, 
is at least 



E 



^(^A/(n-l): 



E 
E 



X € D 
V (D 



M{n-1) 
A/(n-l)) 



D 



1 



n-1); 



ViDMin^l)), 

= E[P(i^A,(„„l))] 



r. 



By the inductive hypothesis, this is at least 'P(DIS(B(/, r))) — nr. 



Finally, noting E 



P DIS V, 



A/(n) 



> E (Z)jV'/(n))] completes the proof. 



With Lemma im in hand, we ai^e ready for the proof of Theorem [T3l 
Proof [Theorem[T3l Let C, /, V, and A be as in the theorem statement. For m G N, let A~^(m) = 
inf{e > : X{e) < m}, or 1 if this is not defined. We define Ap as a randomized algorithm such 
that, for m G N and C e {X x {-1, +1})™, Ap{C) returns / with probability 1 - X~'^{\C\) and 
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returns — / with probability A^^(|£|) (independent of the contents of C). Note that, for any integer 
fn > A(e), E [er (Zm))] = A^^(m) < \~'^{\{e)) < e. Therefore, Ap achieves some label 
complexity Ap with Ap(e, /, V) = A(e) for all e > 0. 

If Of (A(e)~^) 7^ then since every label complexity is the result clearly holds. 



Otherwise, suppose Of (A(£ 



L>j{l), and take any sequence of values — for which each 



1 has Ei € (0, 1/2), Of (A(2ej)~^) > 12, and 2ej a continuity point of A; this is possible, since 
A is monotone, and thus has only a countably infinite number of discontinuities. We have that 
Of (A(2ej)~^) diverges as i — )• oo, and thus so does X{2ei). This then implies that there exist values 
ri ^ such that each r, > A(2ei)"^ and ^(DiS(B(/,rO)) > (A(2ei)~^) /2. 

Fix any i G N and any n G N with n < Of (A(2ej)~^) /4. Consider running Meta-Algorithm 

2 with arguments Ap and n, and let £ denote the final value of the set C, and let fn denote the 
value of m upon reaching Step 6. Since 2ej is a continuity point of A, any m < A(2ej) and 
Ce{X X {-1, +1})"" has er {Ap{C)) = A-^(m) > 2ei. Therefore, we have 



E 



er ( ^p ( £ 



> 2eiF ( |£| < X{2ei 
n 



2£i 



A > 



6A(2e,0 



2e»] 

= 2ei ( 1 



n/(6AjJ < A(2e,; 
A < 



6A(2ei 



(63) 



Since n < 0j (A(2ei)"') /4 < P(DIS(B(/, ri)))/(2r0 < A(2ei)7'(DIS(B(/, ri)))/2, we have 
A < 



n 



6A(2ei 

< I 



< 



A<T'(DIS(B(/,ri)))/12 
(DIS {VI)) < P(DIS(B(/, r,)))/12} u[a<V (DIS (V^))} 



Since m < M( [n/2] ), monotonicity and a union bound imply this is at most 



P DIS V, 



A/(rn/2l) 

Markov's inequality implies 



< P(DIS(B(/, r,)))/12 +f(A<V (DIS (Vt^)) 



(64) 



(65) 



DIS V, 



M(ln/2l 



E 



< 



<P(DIS(B(/,ri)))/12 
P(DIS(B(/, r,))) - V (dIS (^;,(r„/2i) 
P(DIS(B(/,r,)))-P(DIS 



>l^P(DIS(B(/,r,))) 



iip(DIS(B(/,rO)) 




E 



P(DIS(B(/,r,))) 



Lemma |48] implies this is at most jf p(dis(^b(/V ))) — if 



P(DIS(B(/,n))) 
4ri 



7'(DIS(B(/,ri))) 



. Since 



any a > 3/2 has [a] < (3/2)a, and 6*/ (A(2e,)^^) > 12 implies ^(Disg(/,n))) > 3^2, we have 

< ^. Combining 



P(DIS(B(/,r.))) 
4r-i 

the above, we have 



P(DIS(B(/,r,))) 
4ri 



P(DIS(B(/,r,))) - 22 



P (DIS ( 



<P(DIS(B(/,ri)))/12) <-. 



(66) 
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Examining the second term in (l65l) . Hoeffding's inequality and the definition of A from (fT4b imply 



A<P(DIS(K?))j =E 
Combining ( [63l) through ( [671 ) implies 

E er f £ 



A<P(DIS(y,?)) 



< 



E [e^®™] < e~*^ < 1/11. 

(67) 



> 2e, 1 



_9_ J_ 
22 ~ IT 



Thus, for any label complexity achieved by running Meta-Algorithm 2 with Ap as its argument, 
we must have Aa(ej, f,V) > Of (A(2ej)~^) /4. Since this is true for all i G N, and — as 
i — )• cx), this establishes the result. ■ 



Appendix D. The Label Complexity of Meta-Algorithm 3 

As in Appendix El we will assume C is a fixed VC class, V is some ai^bitrary distribution, and 
/ E cl(C) is an ai^bitrary fixed function. We continue using the notation introduced above: in 
particular, S''{'H) = {S eX'' -.n shatters S}, S^{n) = \ S^{n), B^f = \ dy, and 

5f = T"^f~^ ^^c^ ^/^ • Also, as above, we will prove a more general result replacing the "1/2" in 

Steps 5, 9, and 12 of Meta-Algorithm 3 with an arbitrary value 7 G (0, 1); thus, the specific result 
for the stated algorithm will be obtained by taking 7 = 1/2. 

For the estimators Pm in Meta-Algorithm 3, we take precisely the same definitions as given in 
Appendix IB. II for the estimators in Meta-Algorithm 1. In particular, the quantities Am\x, W2,7i), 
^m\Wi, W2,n), ti^\x, y, W2,n), and M^\n) are all defined as in AppendixEH and the An 
estimators ai^e again defined as in ([T2l) . ([T3l ) and (1141 ). 

Also, we sometimes refer to quantities defined above, such as £, m) (defined in (l35l)). as 
well as the various events from the lemmas of the previous appendix, such as Ht-{6), H', Ht\ 



D.l Proof of Theorem [161 

Thi^oughout the proof, we will make reference to the sets Vm defined in Meta-Algorithm 3. Also let 
V^^^ denote the final value of V obtained for the specified value of k in Meta-Algorithm 3. Both 
Vm and y(^) are implicitly functions of the budget, n, given to Meta-Algorithm 3. As above, we 
continue to denote by = {/i G C : Vi < m,h{Xm) = f{Xm)}- One important fact we will 
use repeatedly below is that if Vm = V^ for some m, then since Lemma [35l implies that V^ 7^ 
on H' , we must have that all of the previous y values were consistent with /, which means that 
V£ < m, = V^ . In particular, if V^^ ^ = Vm for the largest m value obtained while k = k' m 
Meta-Algorithm 3, then = V^ for all I obtained while k <k' in Meta-Algorithm 3. 

Additionally, define m„ = [ri/24j , and note that the value m = [n/6] is obtained while k = 1 
in Meta-Algorithm 3. We also define the following quantities, which we will show ai^e typically 
equal to related quantities in Meta-Algorithm 3. Define rfiQ = 0, Tq = [2n/3] , and = 0, and for 
each A;G{l,...,(i+l}, inductively define 
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Itnk = l[7,oo) (X™, VF2,^^_i)) ,V?n G N, 



mfc = mm < 



Uk = (mfc_i,mfc] nN 
Uk = {mk,rhk] n N, 



Cmk — l[0,[3r*/4j~ 



m > ihk-i : Yl = \Ttl^^ ) U {max {A: • 2" + 1, mk-i}] 




^k / J ^mk ^mk' 

and ik = Ql+ ^^rnk- 

The meaning of these values can be understood in the context of Meta- Algorithm 3, under the 
condition that Vm = for values of m obtained for the respective value of k. Specifically, under 
this condition, con^esponds to Tk, tk represents the final value t for round k, rfik represents the 
value of m upon reaching Step 9 in round k, while rhk represents the value of m at the end of round 
k, Uk corresponds to the set of indices arrived at in Step 4 during round k, while Uk corresponds to 
the set of indices arrived at in Step 11 during round k, for m ^Uk, I^f, indicates whether the label 
of Xm is requested, while for m G Uk, I^k ' ^mk indicates whether the label of is requested. 
Finally Q\ corresponds to the number of label requests in Step 13 during round k. In particular, 
note fhi > fhn. 

(i) 

Lemma 49 For any r G N, on the event H r\Gr , V/c, m G N with k < df, \/x £ X, for any 
sets % and %' with V^* C H C "H' C B(/, ri/g), if either k = 1 or m > t, then 

A(^) ix,W2,n) < (3/2) A(^^) {x,W2,'H') . 

In particular, for any 6 G (0, 1) and r > r(l/6; 6), on H' n Hr{d) n Gr \ V/c, e,£',m G N with 
m>T,e>e'>T,andk< df, Vx G X, A^^^ {x, W2, V/) < (3/2)aJ^^ (x, W2, V^l). o 

Proof First note that Vm G N, Vx G , 

A« ix,W2,n) = 1dis{«)(^) < 1dis(«')(^) = {x,W2,'H') , 

so the result holds for /c = 1. Lemma [35l Lemma l40l and monotonicity of Mm\-) imply that on 

H' n G^r\ for any m > r and /c G |2, . . . , d/j, 

Mi^) (n) > Y ht'f (^f - (2/3)mW (B(/,ri/6)) > (2/3)M(f) {n') , 



i=l 
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so that G X, 

i=l ^ 

<mw (Hri^V(^,) (5f u{x}) 
1=1 

< (3/2)mW (?^')"'El5H«') (^f U{x}) = (3/2) AW (x,Ty2,^') . 

i=l 

The final claim follows from Lemma l29l 



Lemma 50 For any k€{l,...,d+l},ifn> 3-#^\ then T* > 4^-''(2n/3) andik < [3T*/4:\. 



Proof Recall = [2n/3] > 2n/3. If n > 2, we also have [3Tf/4:\ > [Tf /4] , so that (due to the 
C^i factors) ii < [3T^*/4J. For the purpose of induction, suppose some k G {2, . . . , d + 1} has 
n > 3 • 4^-1, TU > 42-'=(2n/3), and 4-i < L3T^i/4j. Then T,! = r*_, - 4„i > T*_,/4 > 
4i-'=(2n/3), and since n > 3-4'=-\ we also have L3T*/4J > [r^!/4] , so that 4 < [3T^/A\ (again, 
due to the factors). Thus, by the principle of induction, this holds for all A; G {1, . . . , d + 1} 
with n > 3 • 4'=-^ ■ 

The next lemma indicates that the "t < [3Tfc/4j" constraint in Step 12 is redundant for k < dj. 
It is similar to (ISTl) in Lemma |45l but is made only slightly more complicated by the fact that the 
A^*^) estimate is calculated in Step 9 based on a set Vm different from the ones used to decide 
whether or not to request a label in Step 12. 

Lemma 51 There exist (C, "P, /, ^)-dependent constants cf^ , G [1, oo) such that, for any 6 G 
(0, 1), and any integer n > c[*^ In (c^^/dj, on an event 



with P (<^)) > 1 - 25, VA; G { 1 , . . . , J/ }, ik = E I^^k < 3r*/4. 



Proof Define the constants 



max l-i^, ^-4^] ' cf = max 1^^, ( c« + c(-*)(7/16) + 125<i/^^ 

\r(3/32)' 5,7^ /' 2 lr(3/32)' V ^ ^ ^ / 



and let n(*^((5) = c^*^ In yc^^ /5j . Fix any integer n > n^^\6) and consider the event 
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By Lemma [49l and the fact that > m„ for all k > 1, since n > n^^\6) > 24t{1/6;5), on 
M^\<5),VfcG |l,...,J/},VmG4, 



A« (X^,T^2,C-i) < (3/2)A« (X^,VF2,y4) 



(68) 



Now fix any k G Since n > n^'\6) > 27 • Lemma [50] implies > 18, 

which means that 3T*/4 - \T*/A\ > AT*/9. Also note that < \T*/4\. Let iV, = 



(4/9)T^. Thus, we have 



; note that 



, so that Nk < 



m=mj._i+l 



(69) 



where this last inequality is by (|68] ). To simplify notation, define = {T^,mk-,Wi^W2,V^^^. 
By Lemmas |43] and 111] (with P = 3/32, C = 27/3, a = 3/4, and ^ = 7/I6), since n > nW((5) > 
24 • max {r(™)(7/16; -5), r(3/32; 5)}, on Vm G 4, 

P2^/3{k,rhk,m) <V {x : {k,mk) > 7/2) + exp |-7^M(m)/256| 
< P (x : {k,mk) > 7/2) + exp ^-j^M{7hk)/256j 
<Ag(TVi, 1^2,^4). 

Letting (^^(A:) denote the event thatp27/3(fc, 'rhk,m) < A^^ (VTi, W2, 1/? J, we see that G"„(A;) 5 

Hn\5)- Thus, since the 1(27/3,00) ^'^m'* (-'^m, W^2, Kftfe)) variables are conditionally independent 

given Zfc for m G Z^fc, each with respective conditional distribution Bernoulli (P27/3 (^i "^fci "i))> 
the law of total probability and a Chernoff bound imply that ( [69] ) is at most 



G'n{k) { ^ 1 [27/3,00) [^m 



E 



< E 



1 [27/3,00) [^m 

meWfe 

exp{-Ag (1^1, 1^2, VI,) \Uk\ /27}] < E [exp{-r,Vl62}] < exp{-n/ (243 • 4^-^) }, 
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where the last inequaUty is by Lemma [50l Thus, there exists Gn{k) with P \^Hn\6) \ Gn{k)j < 
exp{-n/ (243 • such that, on Hi^\6) n Gn{k), we have Emirn,_i+i ^mfc < 

Defining M\6) = H^}\5) n flfcii Gn{k), a union bound implies 

P (H^r^\5) \ H^:\6)) < df ■ exp [-n/ (243 • A^f-^) } , (70) 

and on H^\5), every A; G |l, . . . , J/j has ^^trhk-i+i ^mk - ^^fc/^- particular, this means 

the C;^^ factors are redundant in Q*, so that 4 = ^mfc- 
To get the stated probability bound, a union bound implies that 

1 - P < (1 - P (H^A^))) + (l - P (f« )) + P \ ^£^^^7/16)) 

< (5 + c(*) •exp|-M(m„)/4} 

+ c(™)(7/16) • exp|-M(m„)7V256| + 3(i/ • exp {-2?n„} 
+ 121(J/5~^ • exp |-M (m„) /6o} 

< 5 + (c(*) + c(*") (7/I6) + 124^/(5^^) • exp | -rh J f-^^ /512^ . (71) 
Since n > n^^\6) > 24, we have m„ > n/48, so that summing (ITOb and (TtTI ) gives us 

1 - P (^H^'\S)^ <6 + (c(*) + c(™)(7/16) + 125J/5y^) • exp {-n5/7V ^512 • 48 • 4'^'/"^) } . 

(72) 

Finally, note that we have chosen n^'^\5) sufficiently large so that (1721 ) is at most 25. ■ 

The next lemma indicates that the redundancy of the "t < [3Tfc/4j" constraint, just established 
in Lemma[5T] implies that all y labels obtained while k < df we consistent with the target function. 

Lemma 52 Consider running Meta-Algorithm 3 with a budget n G N, while f is the target func- 
tion and V is the data distribution. There is an event Hn^^ and {C,V, f, j) -dependent constants 
cf \ ^ G [1, 00) such thatjorany 6 G (0, 1), ifn > cf ^ In (0^/6^ thenF (^M\6) \ M"^) < 

6, and on (6) n Hi''\ we have V^^f^ = Vrn- =VX . o 
Proof Define cf) = max | \ ^52ii_ 2" \ ^ max c(-), exp {r*}|, let 

1 \ 1 ' '■(i-7)/6' 5)73 J ' 2 \2 'r(i_^)/6' ' L J J. 

= c[**^ In (c2*V'^) ' suppose n > n("^((5), and define the event ^i"^ = 

By LemmagB since n > n'-''\5) > 24-max {t((1 - 7)/6; .5), r*}, on M^(J)n#i"\ Vm G N 
and k £ 1 1, . . . , with either A; = 1 or m > rhn, 

A^^^ {X^, W2, < 7 ^ f {X^, -f{X^),W2, < f {Xm, f{Xm),W2, F^.i) • 

(73) 
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Recall that rhn < min{[Ti/4] ,2"'} = [[2n/3] /4]. Therefore, Vm^ is obtained purely by rhn 
executions of Step 8 while k = 1. Thus, for every m obtained in Meta-Algorithm 3, either k = 1 
or m > rhn. We now proceed by induction on m. We already know Vq = C = Vq, so this serves 
as our base case. Now consider some value m G N obtained in Meta-Algorithm 3 while k < df, 
and suppose every m' < m has Vm' = V^i ■ But this means that Tk = and the value of t upon 
obtaining this particular m has t < YI^=ihk^i+i ^tk- I" particular, if A^^ (X^, W2, Kn-i) > 7, 
then = 1, so that t < TI=rn,^,+i by LemmaEB on F«(<5) n H^"^ , E"l„'.,_,+i I^ak < 
Et'mfe_i+i^mfc < SO that t < 3TfcV4, and therefore y = Y„, = f{X„,y, this implies 

Vm = Vra- On the other hand, on H^n^ {S) n if A^^^ (X^, W2, Kn-i) < 7, then dH implies 

y = argmax f^^) (X^, y, V;„-i) = /(X^), 

!ye{-i,+i} 

so that again Vm = 1^^- Thus, by the principle of induction, on Hn\6) n Hn^\ for every m G N 
obtained while /c < df, we have = Vm' particular, this implies = Vm- = V^. . The 

bound on P (^M^ (S) \ M"-*) then follows from LemmagTl as we have chosen n^^^\5) sufficiently 
large so that (l28l ) (with r = rhn) is at most 6. ■ 



Lemma 53 Consider running Meta-Algorithm 3 with a budget n G N, while f is the target function 
and V is the data distribution. There exist {C,V, f,j) -dependent constants c[***'' , G [l,oo) 
such that, for any 6 G (0,e~'^), A G [l,oo), and n G N, there is an event Hn^^\6,X) with 
[h^^ ((5) n H^J:^ \ ^i"*^ ((5, A)) < 5 with the property that, if 

ri>cf^~ef{d/\Wr-^\, 



then on Hn^ (S) H Hn'^' n Hn'^'^' {5, A), at the conclusion of Meta-Algorithm 3, 



> A. 



Proof Let c 



(Hi) 
1 



max < Ci , c 



M) Mi) d-dfA'°+'"'f 19M \ Mii) 



1 '"-l ' 



-y^S'j ' '■{3/32) 



, Cn 



max 



1^2 >C2 >,.(3/32j tlx 



any 5 G (0,e-3), A G [l,oo), let n(*") (5, A) = cf''^ef{d/X)W{c^^''h/6), and suppose n > 
n(™)((5,A). 



Define a sequence ii = 2* for integers i > 0, and let t = log2 ( 4^"'"'^^ A/75j 



. Also define 

(^(m, 6, A) = max {0 (m; (^/2t) , d/A}, where (p is defined in Lemma |29l Then define the events 

i 

H^'\5, A) = fl H,^ {5/21) , M''\6, A) = H(^\6, A) n [ih^^, > it] . 
1=1 

Note that i < n, so that < 2", and therefore the truncation in the definition of "ij^> which 

enforces mj^ < max • 2" + 1, mfc„i|, will never be a factor in whether or not fh^^ > £1 is 
satisfied. 
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Since n > (A, 6) > cf> In [cf'/6j , Lemma [52] implies that on H^'' (6) n H^' , Vrh^ = 

_ . Recall that this implies that all y values obtained while m < are consistent with their 

dj s 

respective f{Xm) values, so that every such m has Vm = well. In particular, Vm- = V^. . 

df dj 

Also note that n(^'*)((5, A) > 24 • r(™)(7/16; 5), so that r(*'')(7/16; 5) < rhn, and recall we always 
have rhn < rfij^,. Thus, on M\s) n M"-* n A), (taking A^^) as in Meta-Algorithm 3) 

^(df) ^ ^idf) (^vFi, Ws, ] (LemmalSH) 

(Lemma mi) 



(Markov's ineq.) 



< 

< 

< 
< 
< 
< 



+ 4m- 



Sh~5f)v^f (S^f (b (z,,^ (4, 5, A)))) +4^1 
%h~5f)~ef{d/\)Hlu5,\) +Uf 

l2h~5f)~9f{d/X)Hlt,5,\) 
UOfid/X) [^dln(2emax{4',4 /d)+ln{At/6) 



max < 2 



7(5/ 

Plugging in the definition of t and 

d In (2e max d} /d) + In (4^/5) 



d/X 



(Lemma t 

(defnof A)) 
(Lemma |29l) 
(defnof 6i/(d/A)) 

(74) 



< {d/X)-f6fA-'^^'^f In 4i+'^/A/575j < (d/A) In (A/5) 



Therefore, dTU is at most 2^6 f{d/ X){d/ X) In (A/5) /-i~5f. Thus, since 



n(***) (5, A) > max <| ^ In ( c^i' /5 ) , cf '^ In ( c^V^ ) \ , 



-(ii) 



Lemmas in] and [52] imply that on H^n\5) n i/^^ n i?r*^((5, A), 



r(««) n «-(«")( 



> 



> 



4i^'^/2?i/ (^gA^*^/) 



•■2A-ef{d/X){d/X)\n[X/5) 



> Aln(A/(5) > A. 



Now we turn to bounding P ( H^P (5) n M''^ \ H^^ (5, A) ) . By a union bound, we have 



1 - P (^(3) (5, A)) < J] (1 - P (i?,, (5/2^))) < 5/2. 



(75) 



j=i 
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Now con- 



Thus, it remains only to bound P (^Hi'^ {5) D H^l^ n H^^'^ {5, A) n jmj^ < ^^}) . 

For each i G {0, 1, . . . , t - 1}, let Qi = |m G {(.i,ii+i]<rMA^^ : = l| 
sider the set X of alH G {0, 1, . . . , i - 1} with ii > rhn and (^j, ^j+i] n U^^, / 0. Note that 
n(***)((5,A) > 48, so that 4 < rhn- Fix any i G I. Since n(***)(A,(5) > 24 • r(l/6;(5), we 
have rhn > t{1/6;6), so that Lemma |49] implies that on Hi'\6) n H^^ n lettj^g 
g = 2 • (d/j^6j) Of{d/X) ln{X/6), 



P (Hii\6) n n X) n {ft > Q} 



< 



m 



> Q 



W2,V;A. (76) 



For m > ii, the variables 1(27/3,00) ( '^m^^ {Xm, W2, V^*) ) are conditionally (given W2, V^) in 



dependent, each with respective conditional distribution Bernoulli with mean P27/3 [df,ii,m 
Since n(*")((5, A) > 24 • t(3/32; 6), we have m„ > t(3/32; 5), so that Lemma |43] (with C = 27/3, 
a = 3/4, and /3 = 3/32) imphes that on {6) n M'^ n H^^^ {6, A), each of these m values has 



P27/3 {df,i„m^ <v{x: (df,i,^ > 7/2) + exp {-Af(m)7V256} 



< 



+ 



exp|-M(£i)7V256} 



< (2/7^/) P-^'/ (5'^'/ (y,:)) +exp{-M(^,)7V256} 

< (2/7^/) V'f (^S'^f (b (/,^(^„<5,A)))) +exp{-M(^,)7V256} 

< (2/7^/) ef{d/X)^{i^, 5, A) + exp {-M(£,)7V256} 



(Markov's ineq.) 



(Lemma L 
(Lemma [ 
(defnof 6i/(d/A)). 



Denote the expression in this last line by pi, and let B(^j,pj) be a Binomial random vari- 
able. Noting that ii+i - ii = ii, we have that on H^\6) n M""* n ^(^^((5, A), ([761) is at most 
P {B{ii,pi) > Q). Next, note that 



iiPi 



{2M f )e f{d/X)ii^{i^, 6, X)+i^-exp{-i^6f-fy 512} . 



Since u ■ exp {—u^} < (3e) for any u, letting u = ii5fj/8 we have 



ii • exp |-4^^/7V512} < (8/7(5/) u • exp {-n^} < 8/ (7(5/(3e)^/^) < 4/7(5/. 
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Therefore, since ^{£i,5, A) > ^, we have that iiPi is at most 
-^ef{d/X)ei<P{ei,6,X) < 4-^j(d/A)max|2(iln(2e4-) + 21n ) J^d/X 

< -^ef{d/X)max { 2dln . + 2hi ^— , ^ 

70/ I \ 7"/ / \ ^OfO J ^Of 

< -^Of{d/X) max <^ 4dln , ^ ^ 

7^/ ' [V ^^f^ J ^^f J 

6 d4^+'^> , /A\ 46+"^'/^-.,,.., /A^ 



Therefore, a Chemoff bound implies P(B(^j,pj) > Q) < exp {-Q/6} < (5/2^ so that on 
Hi:!\S) n n H^^^6, A), (|76ll is at most 5/21. The law of total probabihty implies there exists 
an event H^^^ {i, 6, A) with P (^M^ (6) n M'^ n #(3) x) \ H^f^ {i, 6, A)) < 6/21 such that, on 

H^\6) n M"^ n H(^\6, A) n A), ft < Q. 

Note that 



lQ < log2 (42+'^a/7(^/j • A^+'^f [d/-f^6}j ef{d/X)ln{X/S) 

< (df4^+'^f /-/^6}^ d9f{d/X) In^ (A/(5) < A^^^fn/12. (77) 



Since Em<2m„ Cd> - '^/^^' if = 1 then dTTll implies that on n M^'^ n H^^\6, A) n 

aexM'^(i,<^,A), Er^^<£,^mi < + E.6x4 < n/12 + < n/6 < \T*/4], so that 

^1 > ^i- Otherwise, if d/- > 1, then every m G Uj^^ has m > 2?7i„, so that Ei<t = ^^Zt^xQi'^ 
thus, on Hi^\s)nHi''^r\H^^\6, A)nnie2 Hi^Hi, <5, A), Eiex Qi < tQ < 4^-^fn/12; Lemma[50] 
implies 4^~'^^n/12 < /4 , so that again we have Thus, a union bound implies 

P n ^(^^) n H^'H5, A) n {m^-^ < it}) 

< P n iff) n A) \ p ^(4)(,^ 5^ 

<^F (5) n ^(^^) n ^(3) ^) \ ^04) (i, 5, A)) < 5/2. (78) 
Therefore, P (M^ (6) n M"-" \ Hn''^ {5, A)) < 5, obtained by summing ([78]) and ([75]). ■ 



Proof [Theorem [T6l If Ap(e/4, /, =00 then the result trivially holds. Otherwise, suppose e G 

(0, 10e^3), let 5 = e/10, A = Ap(e/4,/,P), C2 = max |lo4*\ lOcf lOcf *\ 10e(d + 1)}, and 

ci = max |c^*\ c^"), c(^*"\ 2 • 6^((i + l)(Jln(e(d + and consider running Meta-Algorithm 
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3 with passive algorithm Ap and budget n > ci^j((i/A) ln^(c2A/e), while / is the target func- 
tion and V is the data distribution. On the event Hn\6) n Hn^^ n Hn"\5-, A), Lemma [53] im- 



plies 
that K 



> A, while Lemma [52l implies F^*^/) = T/? ; recalling that Lemma [35] implies 



7^ on this event, we must have er^. (/) = 0. Furthermore, if h is the classifier 
returned by Meta-Algorithm 3, then Lemma [34| implies that er(/i) is at most 2ei{Ap{C^^)), on 

a high probability event (call it E2 in this context). Letting E^{5) = E2 ^ Hn\6) n Hn^'' n 
Hn^^\s, X), the total failure probability 1 — ¥{Es{6)) from all of these events is at most 4(5 + 
e(d + 1) • exp|-[n/3j/ (^72(i/(cf + l)ln(e(d + < 56 = e/2. Since, for ^ G N with 

P ^ C^^ ~ ^) ^ ^' sequence of Xm values appearing in are conditionally distributed as 
given \Ci \ = i, and this is the same as the (unconditional) distribution of {Xi,X2, . . . , Xi}, 



we have that 



E 



er ( h 



< E 



2ev Ar, Cj 



+ e/2 = E 



E 



2ej: A„ 



1 



\c,. 



+ e/2 



< 2 sup E [er(^p {Zi))] + e/2 < s. 

e>Ap{e/4J,r) 



To speciaUze to the specific variant of Meta-Algorithm 3 stated in Section [5?2] take 7 = 1/2. 



Appendix E. Proofs Related to Section [§ Agnostic Learning 

E.l Proof of Theorem [2H Negative Result for Agnostic Activized Learning 

It suffices to show that Ap achieves a label complexity Ap such that, for any label complexity A^ 
achieved by any active learning algorithm Aa, there exists a distribution VxY onA'xl— 1,+1} such 
that VxY G Nontrivial(Ap; C) and yet Aa(z^ + ce, Vxy) 7^ o (Ap(i^ + e, Vxv)) for every constant 
c G (0, 00). Specifically, we will show that there is a distribution Vxv for which Ap(i/ + e, Vxy) = 
e(l/e) and Aa{u + e, Vxy) + o(l/e). 

Let P({0}) = 1/2, and for any measurable A C (0, 1], V{A) = X{A)/2, where A is Lebesgue 
measure. Let B be the family of distributions Vxy on^Yx{ — 1,+1} characterized by the properties 
that the marginal distribution on X is V, ri{0; Vxy) S (1/8, 3/8), and Vx G (0, 1], 

vi^lVxY) = r]{0;VxY) + {x/2) ■ (1 - 7]{0;Vxy)) ■ 

Thus, r]{x;VxY) is a linear function. For any Vxy G 10, since the point z* = \~^^^q°^^-)^ has 
r]{z*;VxY) = 1/2, we see that / = hz* is a Bayes optimal classifier. Furthermore, for any 

7]o G [1/8,3/8], 



1-2% 1 - 2r]{0;VxY) 



l-i]o l-r,{0;VxY) 



\v{0;Vxy) - r]o\ 
(l-r?o)(l-r?(0;Pxy))' 



and since (1 - r]o){l - r]{0;VxY)) G (25/64,49/64) C (1/3, 1), the value z = satisfies 

\vo - v{0;Vxy)\ < \z-z*\ < 3\t]o - viO;VxY)\- (79) 
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Also note that under VxY, since (1 - 2r/(0; Vxy)) = (1 - ??(0; Vxy))z*, any z G (0, 1) has 



ei{hz) - er(/i^ 



so that 



{l-2r]{x]VxY))^x 
(l-r?(0;Pxy)) 



x) dx 



^{z - z*f < er(/i,) - er(/i,.) < -^(z - z*)''. 
lb io 



1 - 2r]{0;VxY) - x{l - r]{0;VxY)))dx 
[1 - viO;VxY)) 



(z* - zY 



(80) 



Finally, note that any x, x' G (0, 1] with |x — z*| < \x' — z* \ has 

\1-2i^{x-Vxy)\ = \x-z*\{l-ri{fd;VxY)) < \x' - z*\{l - 7]{0-Vxy)) = \1 - 27]{x';Vxy)\. 
Thus, for any q G (0, 1/2], there exists z'^ G [0, 1] such that z* G [z'g, z'^ + 2q] C [0, 1], and the clas- 



sifier h'q{x) = hz* {x) ■ yl — 21(2^^^^_|_2g](x) j has er(/i) > er(/ig) for every classifier h with /i(0) = 
— 1 andP(x : h{x) / hz*{x)) = q. Noting that er(/ig) — er{hz*) = ( lim^^^/ er(/i2) — er{hz*)] + 



er(/i 



er(/i2 



implies that er(/i' ) — er(/i^* ) > 



16 

and since maxjz* — z^, z'^ + 2q — z*} > q, this is at least ^q^- In general, any h with ^(0) = +1 
has er(/i) - er{hz*) > 1/2 - r?(0;Pxy) > 1/8 > {l/8)V{x : h{x) / hz*{x)f. Combining these 
facts, we see that any classifier h has 



*) +(4 + 29 



er(/i) - exihz") > {l/S)V {x : h{x) / hz-{x)f . 



(81) 



Lemma 54 The passive learning algorithm Ap achieves a label complexity Kp such that, for every 
VxY(^^,Kp{u + e,VxY) = Q{l/e). o 

Proof Consider the values 770 and z from Ap{Zn) for some n G N. Combining (1791 ) and (l80l ). 
we have er(/i5) - ei{hz*) < ^{z - z*f < §(770 - r]{0;VxY)f < ^Vo - v{0;Vxy)?. Let 
Nn = \{i G {l,...,n} : Xi = 0}|, and f/o = G : X, = 0,Yi = +1}| if 

A^n > 0, or ??o = if Nn = 0. Note that 770 = (??o V |) A |, and since 7/(0; Pxy) G (1/8, 3/8), we 
have |i7o — r/(0;7'xr)| < |^o — ^(0;'Pxy)|- Therefore, for any Vxy G ID), 

E[er(/i5) - er{hz*)] < 4E [(r}o - viO;VxY)f] < 4E [(% - viO;VxY)f] 

< 4E [e [(f/o - viO;VxY)f\Nn\ l[„,/4,n](A^n)] + 4P(iV„ < n/4). (82) 

By a Chernoff bound, P(A'^„ < n/4) < exp{— n/16}, and since the conditional distribution of 
Nnfjo given Nn is Binomial(iVfi, 77(0; Vxy)), (l82l) is at most 



4E 



1 



Nn V 7z/4 



??(0;7'xy)(l-f?(0;7'xy)) 



. 4 15 16 68 

+ 4-exp{-n/16} <4 - + 4 <— . 

n 64 n n 



For any n > [68/e] , this is at most e. Therefore, achieves a label complexity Ap such that, for 
any Vxy e D, Ap(i/ + e, P^y ) = [es/e] = e(l/e). ■ 
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Next we establish a corresponding lower bound for any active learning algorithm. Note that this 
requires more than a simple minimax lower bound, since we must have an asymptotic lower bound 
for a. fixed Vxy, rather than selec ting a different Vxy for e ach e value; this is akin to the strong 
minimax lower bounds proven by lAntos and Lugosil (Il998h for passive leaining in the realizable 
case. For this, we proceed by reduction from the task of estimating a binomial mean; toward this 
end, the following lemma will be useful. 

Lemma 55 For any nonempty (a, h) C [0, 1], and any sequence of estimators pn : {0, 1}" — t- [0, 1], 
there exists p G (a, b) such that, ifBi, B2, ■ ■ ■ are independent Bernoulli(p) random variables, also 



independent from every pn, then E 



iPniBi,...,Bn)-pr /o(l/n). 



Proof We first establish the claim when a = and 6 = 1. For any p G [0, 1], let Bi{p), B2{p), ■ ■ ■ 
be i.i.d. Bernoulli(p) random variables, independent from any internal randomness of the p„ esti- 
mators. We proceed by reduc tion from hypothesis testing, fo r which there are known lower bounds. 
Specifically, it is known (e.g.. IWaldl [l 945l ; lBar- YosseA boosh that for any p,q £ (0, 1), 6 £ (0, e"^), 
any (possibly randomized) q : {0, 1}" — )• {p, q}, and any n € N, 

;i-8(5)ln(l/85) 



n < 



8KL(p||g) 



max P (p* ),..., B„ (p* ))/ )> 5, 



where KL(p||g) = pln{p/q) + (1 — p) ln((l — — (/)). It is also known (e.g.. |Poland and Hutter . 
20061) that for p, q G [1/4, 3/4], KL{p\\q) < {8/3){p - qf. Combining this with the above fact, we 



have that for p,q e [1 /4, 3/4], 

max P (/),..., (1/16) •exp{-128(p-(/)V3}. (83) 

p*e{p,q} 

Given the estimator p„ from the lemma statement, we construct a sequence of hypothesis tests as fol- 
lows. Fori G N, let aj = exp{— 2'} and rij = [l/afj. Define Pq = 1/4, and fori G N, inductively 
define qi{bi, ...,bnj = argminpg|p,_^ |p„, (61, ... , 6„J - p\ for 61, . . . , G {0, 1}, 

and p. = argmaxpg|p*_^ P {qi{Bi{p), Bn,{p)) / p). Finally, define p* = limi^oo Pi- 

Note that Vi G N, pf < i/2, p*_i,P*-i + e [1/4,3/4], and < p* - p* < "i < 

2aj+i = 2a^. We generally have 



E 



(p.,(i?i(p*),...,i?.,(p*))-p*r 



> -E 

- 3 

> -E 

- 3 



{Pn^B^ip* 
(PnABlip* 



.BnAP*))-pl 

,BnXp*))-P* 



{p*-piy 



Furthermore, note that for any m G {0, . . . , Ui}, 



(p*)"(l -p*)"« 
(p*)-(l-p*)-« 



> 



1 -p* 
1^ 



> 



Pi 



2ai 



1-p* 



> (1 - 4a,2)"' > exp{-8Qfn,} > e" 



so that the probability mass function of (i?i(p*), . . . , Bn^{p*)) is never smaller than e ^ times that 
of (i?i(p*), . . . , Bmip*)), which imphes (by the law of the unconscious statistician) 



E 



{PnABl(j}*),...,BnAp*))-p*y 



> e""E 



(Pn, {Bi{p* ),..., Bn, (p ■ ))-p* 
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By a triangle inequality, we have 

E 

By ([83]), this is at least 



2 

^(1/16) • exp{-128a2ni/3} > 2-^e-^^a\. 



Combining the above, we have 



E 



*\2 



For i > 5, this is larger than 2 ^^e ^^n,- ^. Since diverges as i — )■ oo, we have that 



E 



{VnSBiipl, . ■ .,Br,Sp*)) -P*f / o(l/n 



which establishes the result for a = and 6 = 1. 

To extend this result to general nonempty ranges (a, 6), we proceed by reduction from the 
above problem. Specifically, suppose p' G (0, 1), and consider the following independent ran- 
dom variables (also independent from the Bi{p') variables and p„ estimators). For each i G 
N, di ~ Bernoulli(a), Ci2 ~ Bernoulli((6 - a)/(l - a)). Then for hi G {0,1}, define 
B[{bi) = max{Cji,Ci2 • hi]. For any given p' G (0,1), the random variables B[{Bi{p')) are 
i.i.d. Bernoulli (p), with p = a + {h — a)p' G (a, h) (which forms a bijection between (0, 1) and 
{a,h)). Defining = {pn{B[{hi) , . . . , B'^{hn)) - a)/(6 - a), we have 



E 



{pn{Bi[p)....,Bn{p))-pf ={h-af-E {p',iBiip'),...,Bnip'))-p'y 



(84) 



We have already shown there exists a value of p' G (0, 1) such that the right side of (l84l ) is not 
o{l/n). Therefore, the corresponding value of p = a + (6 — a)p' G (a, h) has the left side of i 
not o(l/n), which establishes the result. 



We are now ready for the lower bound result for our setting. 

Lemma 56 For any label complexity Aq achieved by any active learning algorithm Aa, there exists 
aVxY G such that Ka{v + e,Vxy) / o(l/e). o 

Proof The idea here is to reduce from the task of estimating the mean of iid Bernoulli trials, 
corresponding to the Yi values. Specifically, consider any active learning algorithm Aa', we use 
Aa to construct an estimator for the mean of iid Bernoulli trials as follows. Suppose we have 
Bi, B2, . . . , Bn i.i.d. Bernoulli(p), for some p G (1/8, 3/8) and n G N. We take the sequence 
of Xi, X2, . . . random variables i.i.d. with distribution V defined above (independent from the Bj 
variables). For each i, we additionally have a random variable Ci with conditional distribution 
BernouIli(Xi/2) given Xi, where the d are conditionally independent given the Xi sequence, and 
independent from the Bi sequence as well. 
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We run Aa with this sequence of Xi values. For the t^^ label request made by the algorithm, 
say for the Yi value corresponding to some Xi, if it has previously requested this Yi already, then 
we simply repeat the same answer for Yi again, and otherwise we return to the algorithm the value 
2 max{i?( , Cj} — 1 for Y^. Note that in the latter case, the conditional distribution of max{Bt, Ci} 
is Bernoulli(p + (1 — p)Xi/2), given the Xi that Aa requests the label of; thus, the Yi response has 
the same conditional distribution given Xi as it would have for the VxY G ^ with ry(0; Vxy) = P 
(i.e., 7]{Xi;VxY) = P + (1 — p)Xi/2). Since this Yi value is conditionally (given Xi) independent 
from the previously returned labels and Xj sequence, this is distiibutionally equivalent to running 
Aa under the Vxy G IP with r/(0; Vxy) = P- 

Let hn be the classifier returned by Aain) in the above context, and let Zn denote the value of 
z e [2/5,6/7] with minimum : / hn{x)). Then define = e [1/8,3/8] and 

z* = G (2/5,6/7). By a triangle inequality, we have \zn-z*\ = 2V{x : h^^^x) / hz*{x)) < 
4V{x : hn{x) / hz* (x)). Combining this with dSB and ^ imphes that 



X : hn{x) / hz*{x) 



2 1 

- 128 ^ 



z*f > 



In particular, by Lemma [55l we can choose p G (1/8, 3/8) so that E (p„ — p) / o(l/n), which. 



1 
128 

2 



(Pn - pY 



(85) 



by (|85l ), implies E er(/i„ 



u 7^ o(l/n). This means there is an increasing infinite sequence of 

I' > c/nfc. Supposing 



values Uk € N, and a constant c G (0, oo) such that VA; G N, E er(/i„j,) 
Aa achieves label complexity A^, and taking the values = c/ (2nfc), we have Aa(z^ + 6^, "Pxy ) > 
'^fc = c/ (2efc). Since > and approaches as fc — ^ oo, we have Aa{iy + e, Vxy) / o(l/e). ■ 



Proof [of Theorem [22l The result follows from Lemmas [54l and [56l 



E.2 Proof of Lemma Label Complexity of Algorithm 5 

The proof of Lemma|26]essentially runs parallel to that of Theorem[T6l with variants of each lemma 
from that proof adapted to the noise-robust Algorithm 5. 

As before, in this section we will fix a particular joint distribution Vxy on A! x { — 1, +1} 
with marginal V on X, and then analyze the label complexity achieved by Algorithm 5 for that 
particular distribution. For our purposes, we will suppose Vxy satisfies Condition [T] for some 
finite parameters fi and k. We also fix any / G H cl(C(e)). Furthermore, we will continue 

£>0 

using the notation of Appendix iBl such as 8^(1-1), etc., and in particular we continue to denote 
= {/i G C : < m, h{X() = f{Xi)} (though note that in this case, we may sometimes have 

f{X() 7^ y^, so that / C[Zm]). As in the above proofs, we will prove a slightly more general 

result in which the "1/2" threshold in Step 5 can be replaced by an arbitrary constant 7 G (0, 1). 
For the estimators P^m used in the algorithm, we take the same definitions as in Appendix IB. II 

To be clear, we assume the sequences Wi and W2 mentioned there are independent from the entire 

(Xi, Yi), {X2, Y2), ■ ■ ■ sequence of data points; this is consistent with the eaiiier discussion of how 

these Wi and W2 sequences can be constructed in a preprocessing step. 

We will consider running Algorithm 5 with label budget n G N and confidence parameter 

5 G (0, e~^), and analyze properties of the internal sets Vi. We will denote by Vi, Ci, and ij^, the 
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final values of Vi, Ci, and i^, respectively, for each i and k in Algorithm 5. We also denote by fn^^^ 
and V^^'' the final values of m and Vij,+i, respectively, obtained while k has the specified value in 
Algorithm 5; V^''^ may be smaller than V-^ when mP^'> is not a power of 2. Additionally, define 

= {(^m> ^m) 1^=2^-1+1 • -^ftei' establishing a few results concerning these, we will show that 
for n satsifying the condition in Lemma |26l the conclusion of the lemma holds. First, we have a 
few auxilliary definitions. For H C C, and any i G N, define 

(pi{n) = E sup |(er(/ii) -er£*(/ii)) - (er(/i2) -er£*(/i2))| 



and U.i-H, 6) = min { K \ WH) + ^/ diam(^)i^^i|^ + ^^^^ 



where for our purposes we c an take K = 8272. It is known (see e.g., Massart and Nedelec , 20061: 



Gine and Koltchinskii . 20061) that for some universal constant d G [2, oo). 



t+i{n) < c'max |ydiam(^)2-*(ilog2 ^j— ^^''^^j • (86) 
We also generally have cl)i{7i) < 2 for every z S N. The next lemma is taken from the work of 



Koltchinskii ( 20061) on data-dependent Rademacher complexity bounds on the excess risk. 



Lemma 57 For any 6 £ (0, e '^), any ?^ C C with f € cl{T-l), and any i £ N, on an event Ki with 

^{Ki) > 1 - 5/4*^' v/i G n, 

eTc*{h) - mill eTc*{h') < er(/i) - er(/) + UiCH, 6) 

h 

er(/i) - er(/) < erc*{h) - erc*{f) + Ui{n,5) 
min|j7,(?^,(5),l} < o 

Lemma[57]essentially follows fro m a version of Talag rand's inequality. The details of the proof 
may be extracted from the proofs o f iKoltchinskii (l2006h . and related derivations have previously 
been presented by iHannekd (|201ll) : iKoltchinskiil (|2010|) . The only minor twist here is that / need 
only be in cl(J-L), rather than in Ti itself, which easily follows from Koltchinskii 's original results, 
since the Borel-Cantelli lemma implies that with probability one, every e > has some g £ 'H{e) 
(very close to /) with eT:c*{g) = er£*(/). 

For our purposes, the important implications of Lemma [57] are summarized by the following 
lemma. 

Lemma 58 For any 5 € (0, e~^) and any n E N, when running Algorithm 5 with label budget n 
and confidence parameter 5, on an event J„(5) with P(J„((5)) > 1 — 6/2, Vi € {0, 1, . . . , id+i}, if 
V* C Vi then V/i G 

er^* {h) - min er^* {h') < ei{h) - er(/) + f/^+i 6) (87) 
+ h'eVi + 

er(/i) - er(/) < er^.^^(/i) - eTc*^^{f) + Ui+i{Vi, 6) (88) 
min {Ui+i{Vi,6),l} <Ui+i{Vi,6). (89) 



105 



Hanneke 



Proof For each i, consider applying LemmalST] under the conditional distribution given Vi. The set 
is independent from Vi, as are the Rademacher variables in the definition of i?j+i(Vi). Further- 
more, by Lemma[35l on H', / G cl (VJ) , so that the conditions of Lemma[57]hold. The law of total 
probability then implies the existence of an event Jj of probability P( Jj) > 1 — 5/4(z+l)^, on which 
the claimed inequalities hold for that value of i if i < id+i- A union bound over values of i then 
implies the existence of an event Jn(5) = fit with probability P( J„(5)) > 1 — 5/4(i + 1)^ > 
1 — (5/2 on which the claimed inequalities hold for all i < id+i- ■ 



Lemma 59 For some {C,VxY,l)-dependent constants c,c* G [I, oo), for any 5 G (0,e ^) and 
integer n > c* ln(l/ 6), when running Algorithm 5 with label budget n and confidence parameter 5, 
on event Jn{S) n Hn^ n Hn^\ every z G {0, 1, . . . , } satisfies 



f- 

di + ln{l/6) \' 
2' / 



and furthermore V* ,^ ^ yi'^f). 

rh f' 



Proof Definec= (24Kc'^y-\c* = max |r*, 8d (,^) ^''"Mog^ (,^) }, and 

suppose n > c* ln(l/(5). We now proceed by induction. As the right side equals C for i = 0, the 
claimed inclusions are certainly true for Vq = C, which serves as our base case. Now suppose some 
i G {0, 1, . . . , ij^} satisfies 

V,cv.cc(c('^±m^]^']. (90) 



2» 



In particular, Condition [T] implies 



If i < ij^, then let k be the integer for which i^^i < i < ik, and otherwise let k = df. Note that 
we certainly have ii > [Iog2(n/2)J, since m = [n/2\ > 2l-'°S2("/2)J obtained while k = 1. 
Therefore, if A; > 1, 

di + ln{l/6) 4(ilog2(n) + 41n(l/(5) 
2' - n ' 

so that ( |9T] ) implies 

/.-A 1 Mdlogofn) + 41n(lM)\ 2^ 
diam [Vij < fic^ ( ^ ) 

By our choice of c*, the right side is at most r(x_^)/6- Therefore, since Lemma [35] implies / G 
we have Vi C B (/, r(i„^)/g) when A; > 1. Combined with (|90l), we have that 
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V^i ^ Vi, and either /c = 1, or C B(/, r(i„^)/g) and Am > 4[n/2j > n. Now consider any m 
with 2* + 1 < m < min|2*+\m('^/)|, and for the purpose of induction suppose ^ T/j^^ 

upon reaching Step 5 for that value of m in Algorithm 5. Since ViJ^i C and n > r*, Lemma |4T] 
(with £ = m - 1) implies that on h'^P n H^''\ 

(92) 

so that after Step 8 we have C Vi+i. Since (|90l ) implies that the V*^_i C Vi^i condition holds if 
Algorithm 5 reaches Step 5 with m = 2* + 1 (at which time Fj+i = Vi), we have by induction that 
on Hn^ n Hn^\ V*^ C Vi+i upon reaching Step 9 with m = min |2*+^, m^'^-') |. This establishes 
the final claim of the lemma, given that the first claim holds. For the remainder of this inductive 
proof, suppose i < ig^. Since Step 8 enforces that, upon reaching Step 9 with m = 2*+^, every 

hi, /i2 G Vi+i have er^^^^(/ii) - er^^^^(/i2) = er£*^^(/ii) - er£^^^(/i2), on J„(5) n hI^^ n i/^"-" 
we have 



Vi+i C <( /i G T/j : er£* (/i) - min er£* {h') < Ui+i [Vi,5 
C {/I G V- : er^.^^(/i) - er^.^^C/) < U,+i (k„ 5 



c nc (^2?7,+i [y^.^)) Q c (^2[/,+i j , (93) 

where the second line follows from Lemma [35] and the last two inclusions follow from Lemma [58l 
Focusing on ( |93l ). combining ( |9T] ) with (l86l ) (and the fact that (/)j_|_i(Vi) < 2), we can bound 

Ui+i (Vi,S^ as follows. 



jL /2(ii + 21n(l/(5)\^ /8(i + 1) + 21n(l/5)\ ^ 



2^+1 / I 2*+i 

^ d (i + 1) +ln(l/J ) ^ 
2^+1 



and thus 



1 
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Combining this with ( 1931 ) now implies 

' d{i + 1) + ln{l/ 6) 



Vi+i C 



2i+i 



To complete the inductive proof, it remains only to show V*i+i C Vi+i. Toward this end, recall 

we have shown above that on Hn^ n Hn^\ V*^^i C V^+i upon reaching Step 9 with m = 2*"*"^, and 
that every hi,h2 € Vi+i at this point have er^^^(/ii) — er^ _^^(/i2) = er£*^^(/ii) — er£*^_^ (/12). 
Consider any h G VJ+i, and note that any other g G V*,+i has er£*^^(5() = er£*^^(/i). Thus, on 

er/; f/i) — min ei r (h') = err* (h) — min err* (/i') 

< er£* (h) — min er^* (h') = inf er^* (g) — min er^* (/i'). (94) 

Lemma [58] and ( |90l ) imply that on Jn(5) H //n^ n -ffn*\ the last expression in (|94l ) is at most 
infggy^*^^ er((7) — er(/) + Ui+i{Vi,6), and Lemma |35] implies / G cl(V2*+i) on Hn \ so that 
infggy^*^^ er((7) = er(/). We therefore have 

er,; (h) — min er,; (h') < Ui+iiV:, 5), 

so that h G V^+i as well. Since this holds for any h G V^t+i, we have V^,+i C Fj+i. The lemma 
now follows by the principle of induction. ■ 



Lemma 60 There exist (C, VxY ,l)-dependent constants c{^C2 G [1, 00) such that, for any e,6 G 
(0, e~^) and integer 

when running Algorithm 5 with label budget n and confidence parameter 5, on an event JjJ(e, S) 
with P(J*(e, S))>1- 5, we have y.. C C(e). o 



Proof Define 



Ci = max < 2 



df+5 f 



i/ft 



(l-7)/6 



dlog. 



d^cV'' 2 



' ^(l-7)/6 ' 



lnf8c»),^ln(8c(^^) 



and 



C2 = max < 



1/k 



2k-1 



/{l-7)/6 



,2'^'/+^5.^1ogi(4dc) 



Fix any e, (5 G (0, e~^) and integer n > c*^ + clfif (e« ) ^ logi (^). 
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For each i G {0, 1, . . .}, let fj = fic 



^^^^^±M1MV^~\ Also define 



1\ c 

2 log2 - + log2 

K J £ 



2dc 



and let i = min {i G N : supj>j fj < r^i.^j/g}. For any i G |i, . . . , ij^ |, let 

Q,+i = |m G {2* + 1, . . . , 2^+^} : Af^^ (X^, 1^2, B (/, f,)) > 27/3 



Also define 



^ 96 - / i\ 9 / 2fic\ 2 



-f6f 



By Lemma[53and Condition [B on J„((5) n H^^^ n F^**'', if i< ia , 



(95) 



Lemma[59lalso implies that, on Jn{S) H Hn^ n for i with < ^ < ^^^> all of the sets T/j+i 

obtained in Algorithm 5 while k = df and m G {2* + 1, . . . , 2*+^} satisfy VgVi ^ ^i+i ^ ^■ 
Recall that zi > [log2(n/2)J, so that we have either = 1 or else every m G {2* + 1, . . . , 2*+^} 
has 4m > n. Also recall that Lemma |49] implies that when the above conditions are satisfied, and 



i > 5, on H'nc'iK Ai^^ W2, Fi+i) < iS/2)A'Z' (^m,, W^2, B (/, n)), so that |Q,+i| upper 
bounds the number of m G {2* + 1, . . . , 2*+^} for which Algorithm 5 requests the label Ym in Step 

6 of the A; = J/ round. Thus, on J„(5) n n Hi^''\ 2* + ^ 



i=max 



I Qj+i I upper bounds 



the total number of label requests by Algorithm 5 while k = df; therefore, by the constraint in Step 



3, we know that either this quantity is at least as big as 
In particular, on this event, if we can show that 



2~'^fn 



, or else we have 2 ''f > df ■ 2"-. 



2'+ Yl 12' 



1 < 



2''^fn 



and 2*+^ <df'2 



(96) 



i=max< 1,1 J 



then it must be true that i < i^^. Next, we will focus on establishing this fact. 

Consider any i G |max |i, , . . . , min z| | and any m G {2' + 1, . . . , 2'+^}. If 



df = 1, then 



Af^) {X„„ W2,B (/, u)) > 27/3 W2 ] = (S'^f (B (/, h)) ) . 
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Otherwise, if df > 1, then by Markov's inequahty and the definition of A^^^ (■) ■) ■) from ( [T6l ). 



Af^^x^, 1^2, B (/, f,)) > 27/3 



27 



Wo 



(4m)3 



^F(si^'^u{x^}eS''f (B(/,f,)) 



5. 



{df) 



By Lemma[39l Lemma|59l and on J„((5) n i^i*^ n H^'\ this is at most 



=1 



5 



^3 23i+3 



4323i+3 



s=l 



S. 



(df) 



Note that this value is invariant to the choice of m G 1 2* + 1, . . . , 2*+^ } . By Hoeff ding's inequality, 
on an event J*{i) of probability P (Jnii)) > 1 - (5/(16i^), this is at most 



(97) 



Since i > ii > log2(f^/4) and n > ln(l/(5), we have 



4323j+3 



H4i/5) ^ / ln(41og2(n/4)/<^) ^ / ln(n/J) ^ ^-^^ 



128n 



128n 



Thus, dW] ) is at most 



24 



p(2- + p'^/(5'^/(B(/,f,)) 



In either case (df = 1 or df > 1), by definition of 6*/ (e« ), on J„(5) n H^!^ n i^i"'' n J*{i), 
Vm G {2* + 1, . . . , 2*+^} we have 



Ai;{\x^,W^2,B(/,f,)) >27/3 



Furthermore, the l[27/3.oo) ^^im^ {^m,W2,B (/, ?^j))J indicators are conditionally independent 

given W2, so that we may bound P > Q via a Chemoff bound. Toward this end, note 

that on J„((5) n Hi'^ n n J*(i), dlHll imphes 

2«+i . _ 

E[|Qml|^2]= J] pfAf^^ (X^, TVs, B (/, f,)) > 27/3 

.m=2»+l 

24 / ~ / 1 \ r 1 1 \ 24 

^(2- + ^,(..).max{f.,..})<^ 



W^2) < 1^(2-^ + ^/ (e^) -max {f„e^}). (98) 



< 2* • f2-* + •max|fi,e«|) < ^ (l + 61/ fe^ | • max 



Of (e^) •max{2^fi,2^e^}) . (99) 
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Note that 

2V, = /ic« {di + ln(l/5))^ • 2'(^~^) 

- 1 j_ 1 

Then since 2~*2k-i < « . (8(ilog2 ^) ^""^ , we have that the rightmost expression in (|99l ) is 

at most 

(si) . . . Ai) , |i (1 . (si) . W . ?§) . .i-) , Q/. 

Therefore, a Chernoff bound implies that on Jn{S) n ^ n Hn^^ n J^(i), we have 
p(|Qi+i| > q|vF2) <exp{-Q/6} <exp|-81og2 (^) } 

< exp - log2 ^ ' < S/{8i). 



Combined with the law of total probability and a union bound over i values, this implies there 
exists an event J*{e, 6) C J„(5) n H^}^ n H^''^ with P (^Jn{5) n H^^ n \ JU^, 5)) < 

((5/(16i^) + (^/(8i)) < (5/4, on which every i G |max |i, , . . . , min , i| | has 

IQi+il < Q- 

We have chosen c* and C2 large enough that 2*^"^ < dj ■ 2" and 2* < In particular, 

this means that on Jj^(e, (5), 



2'+ |Qi+i| <2-^/-2n + iQ. 



«=max< I, It 



Furthermore, since i < 31og2 we have 



^ '^'^^fJ-c^d- ( 1 \ 2_2 , 2 4dc 



Combining the above, we have that ( |96l ) is satisfied on J*(e, 5), so that i^^ > i. Combined with 
Lemma [59l this implies that on J^(e, 5), 



V. C C C c 



«dj. ' \ y 2* 



+ ln(l/(5) V«-i 
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and by definition of i we have 



^ ' ^ ' < cf 8(ilog2 — j -2 *2«-i 



< c ( 8dlog2 ^) • (e/c) ■ (sdlog^ ^) = e, 



so that V;. C C(e). 



Finally, to prove the stated bound on P( J*(e, 5)), we have 

1 - P ( 5)) < (1 - FiUm + (l - (^^^)) + IP {h^^ \ Hll"^) 
+ ¥[U5)r^H^'^r^H^')\Jl{e,5)) 
< 3(5/4 + c(*) • exp |-n3(5//8} + c^") • exp | -71(5^^120} < 5. 



Finally, we are ready for the proof of Lemma l26l 
Proof [Lemma l26l First, note that because we break ties in the argmax of Step 7 in favor of a y 
value with Vi^,+i[{Xm,y)] / 0, if Viu+i / ^ before Step 8, then this remains true after Step 8. Fur- 
thermore, the f7jj.+i estimator is nonnegative, and thus the update in Step 10 never removes from 
Vij.+i the minimizer of er^ ^^(/i) among h G Vik+i- Therefore, by induction we have Vi^, ^ 

at all times in Algorithm 5. In particular, Vj^^^^^ 7^ so that the return classifier h exists. Also, 
by Lemma [6OI for n as in Lemma [6OI on Jj^(e, 5), running Algorithm 5 with label budget n and 
confidence parameter 5 results in V-- C C(e). Combining these two facts implies that for such a 

value of n, on J*(e, 5), heV-- , , C Vj_ C C(e), so that [h] < v + e. ■ 



E.3 The Misspecified Model Case 

Here we present a proof of Theorem|28l including a specification of the method A'^ from the theorem 
statement. 

Proof [Theorem |28l Cons ider a weaklv universallv consistent passive learning algorithm A-,, ( De- 
vroye, Gyorfi, and Lugosi, Il996h . Such a method must exist m our setting; for instance, Hoeffd- 
ing's inequality and a union bound imply that it suffices to take Au{C) = argmin^^i er£(l^ ) + 

'\/ ^"^2|£|^ '^' where ^2, • • •} is a countable algebra that generates Tx- 

Then Au achieves a label complexity A„ such that for any distribution VxY on A' x { — 1, +1}, 

Ve G (0,1), Au{e + u*{'Pxy),'Pxy) < 00. In particular, if i^*{'Pxy) < v{C;Vxy), then 

K{{i^*{Vxy) + v{0,Vxy))/2,Vxy) < 00. 

Fix any n G N, and describe the execution of ^'^(n) as follows. In a preprocessing step, 

withhold the first niun = n — [n/2\ — [n/3\ > n/6 examples {Xi, . . . , and request their 

labels {Yi, . . . ,Ym^^}. Run A(L'^/2j) on the remainder of the sequence Xm„„+2, • • •} 
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(i.e., shift any index references in the algorithm by rriun), and let ha denote the classifier it returns. 
Also request the labels ym„„+i, • • • Ym^^j^\n/?.\ > and let 

hu = Au{{{Xm^^+l,Ym^^+i), . . . ,{X„^^^J^Yn/Z\^Ymuu + \rl/^)]) ■ 

If eimun {ha) — ^^niun (^m) > n^^/^, rctum h = hu, otherwise, return h = ha- This method achieves 
the stated result, for the following reasons. 

First, let us examine the final step of this algorithm. By Hoeff ding's inequality, with probability 
at least 1 - 2 • exp {-n^/^/l2], 

l(erm„„(/ia) - erm„„(/i„)) - {ei{ha) - er(/i„))| < n"^/^ 

When this is the case, a triangle inequality implies er(/i) < min{er(/ia), er(/iM) + 2n~^/'^}. 
If VxY satisfies the benign noise case, then for any 

n>2Ka{e/2 + v{C-VxY),VxY), 

wehaveE[er(/ia)] < z^(C;Pxy)+e/2, soE[er(^)] < z^(C; Pxy)+e/2+2-exp{-ni/Vl2}, which 
is at most Vxy) + e if n > 12^ ln^(4/e). So in this case, we can take \{e) = \l2^ ln^(4/e)] . 

On the other hand, if Vxy is not in the benign noise case (i.e., the misspecified model case), then 
for any n > 3Au{{u*{Vxy) + H'C;Vxy))/2,Vxy), E [er(/i„)] < {u*{Vxy) + H'C;Vxy))/2, 
so that 

E[ei{h)] < E[ei{hu)] + 2n-^/^ + 2 • exp{-n^/Vl2} 

< i'^*{VxY) + i^{C;Vxy))/2 + 2n^^/^ + 2 • exp{-n^/yi2}. 

Again, this is at most v{C;Vxy) + e if n > max{l23ln^ |, 64(i/(C; Pxy) - v* {V xy))'^] ■ So 
in this case, we can take 



,3,.32,, (i^*{VxY) + y{C;VxY) ^___\ 64 



max <^ 12^ In^ - , 3/V„ ' W xy 



e' "V 2 ^' („(C;VxY)-i^*{VxY)y 

In either case, we have A(e) G Polylog(l/e). 
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